]>
Commit | Line | Data |
---|---|---|
f65fd747 | 1 | @node Floating-Point Limits |
28f540f4 RM |
2 | @chapter Floating-Point Limits |
3 | @pindex <float.h> | |
4 | @cindex floating-point number representation | |
5 | @cindex representation of floating-point numbers | |
6 | ||
7 | Because floating-point numbers are represented internally as approximate | |
8 | quantities, algorithms for manipulating floating-point data often need | |
9 | to be parameterized in terms of the accuracy of the representation. | |
10 | Some of the functions in the C library itself need this information; for | |
11 | example, the algorithms for printing and reading floating-point numbers | |
12 | (@pxref{I/O on Streams}) and for calculating trigonometric and | |
13 | irrational functions (@pxref{Mathematics}) use information about the | |
14 | underlying floating-point representation to avoid round-off error and | |
15 | loss of accuracy. User programs that implement numerical analysis | |
16 | techniques also often need to be parameterized in this way in order to | |
17 | minimize or compute error bounds. | |
18 | ||
19 | The specific representation of floating-point numbers varies from | |
20 | machine to machine. The GNU C Library defines a set of parameters which | |
21 | characterize each of the supported floating-point representations on a | |
22 | particular system. | |
23 | ||
24 | @menu | |
25 | * Floating-Point Representation:: Definitions of terminology. | |
26 | * Floating-Point Parameters:: Descriptions of the library facilities. | |
27 | * IEEE Floating-Point:: An example of a common representation. | |
28 | @end menu | |
29 | ||
30 | @node Floating-Point Representation | |
31 | @section Floating-Point Representation | |
32 | ||
33 | This section introduces the terminology used to characterize the | |
34 | representation of floating-point numbers. | |
35 | ||
36 | You are probably already familiar with most of these concepts in terms | |
37 | of scientific or exponential notation for floating-point numbers. For | |
38 | example, the number @code{123456.0} could be expressed in exponential | |
39 | notation as @code{1.23456e+05}, a shorthand notation indicating that the | |
40 | mantissa @code{1.23456} is multiplied by the base @code{10} raised to | |
41 | power @code{5}. | |
42 | ||
43 | More formally, the internal representation of a floating-point number | |
44 | can be characterized in terms of the following parameters: | |
45 | ||
46 | @itemize @bullet | |
47 | @item | |
48 | The @dfn{sign} is either @code{-1} or @code{1}. | |
49 | @cindex sign (of floating-point number) | |
50 | ||
51 | @item | |
52 | The @dfn{base} or @dfn{radix} for exponentiation; an integer greater | |
53 | than @code{1}. This is a constant for the particular representation. | |
54 | @cindex base (of floating-point number) | |
55 | @cindex radix (of floating-point number) | |
56 | ||
57 | @item | |
58 | The @dfn{exponent} to which the base is raised. The upper and lower | |
59 | bounds of the exponent value are constants for the particular | |
60 | representation. | |
61 | @cindex exponent (of floating-point number) | |
62 | ||
63 | Sometimes, in the actual bits representing the floating-point number, | |
64 | the exponent is @dfn{biased} by adding a constant to it, to make it | |
65 | always be represented as an unsigned quantity. This is only important | |
66 | if you have some reason to pick apart the bit fields making up the | |
67 | floating-point number by hand, which is something for which the GNU | |
68 | library provides no support. So this is ignored in the discussion that | |
69 | follows. | |
70 | @cindex bias, in exponent (of floating-point number) | |
71 | ||
72 | @item | |
73 | The value of the @dfn{mantissa} or @dfn{significand}, which is an | |
74 | unsigned quantity. | |
75 | @cindex mantissa (of floating-point number) | |
76 | @cindex significand (of floating-point number) | |
77 | ||
f65fd747 | 78 | @item |
28f540f4 RM |
79 | The @dfn{precision} of the mantissa. If the base of the representation |
80 | is @var{b}, then the precision is the number of base-@var{b} digits in | |
81 | the mantissa. This is a constant for the particular representation. | |
82 | ||
83 | Many floating-point representations have an implicit @dfn{hidden bit} in | |
84 | the mantissa. Any such hidden bits are counted in the precision. | |
85 | Again, the GNU library provides no facilities for dealing with such low-level | |
86 | aspects of the representation. | |
87 | @cindex precision (of floating-point number) | |
88 | @cindex hidden bit, in mantissa (of floating-point number) | |
89 | @end itemize | |
90 | ||
91 | The mantissa of a floating-point number actually represents an implicit | |
92 | fraction whose denominator is the base raised to the power of the | |
93 | precision. Since the largest representable mantissa is one less than | |
94 | this denominator, the value of the fraction is always strictly less than | |
95 | @code{1}. The mathematical value of a floating-point number is then the | |
96 | product of this fraction; the sign; and the base raised to the exponent. | |
97 | ||
98 | If the floating-point number is @dfn{normalized}, the mantissa is also | |
99 | greater than or equal to the base raised to the power of one less | |
100 | than the precision (unless the number represents a floating-point zero, | |
101 | in which case the mantissa is zero). The fractional quantity is | |
102 | therefore greater than or equal to @code{1/@var{b}}, where @var{b} is | |
103 | the base. | |
104 | @cindex normalized floating-point number | |
105 | ||
106 | @node Floating-Point Parameters | |
107 | @section Floating-Point Parameters | |
108 | ||
109 | @strong{Incomplete:} This section needs some more concrete examples | |
110 | of what these parameters mean and how to use them in a program. | |
111 | ||
112 | These macro definitions can be accessed by including the header file | |
113 | @file{<float.h>} in your program. | |
114 | ||
115 | Macro names starting with @samp{FLT_} refer to the @code{float} type, | |
116 | while names beginning with @samp{DBL_} refer to the @code{double} type | |
117 | and names beginning with @samp{LDBL_} refer to the @code{long double} | |
118 | type. (In implementations that do not support @code{long double} as | |
119 | a distinct data type, the values for those constants are the same | |
120 | as the corresponding constants for the @code{double} type.)@refill | |
121 | ||
122 | Note that only @code{FLT_RADIX} is guaranteed to be a constant | |
123 | expression, so the other macros listed here cannot be reliably used in | |
124 | places that require constant expressions, such as @samp{#if} | |
125 | preprocessing directives and array size specifications. | |
126 | ||
f65fd747 | 127 | Although the @w{ISO C} standard specifies minimum and maximum values for |
28f540f4 RM |
128 | most of these parameters, the GNU C implementation uses whatever |
129 | floating-point representations are supported by the underlying hardware. | |
f65fd747 | 130 | So whether GNU C actually satisfies the @w{ISO C} requirements depends on |
28f540f4 RM |
131 | what machine it is running on. |
132 | ||
133 | @comment float.h | |
f65fd747 | 134 | @comment ISO |
28f540f4 RM |
135 | @defvr Macro FLT_ROUNDS |
136 | This value characterizes the rounding mode for floating-point addition. | |
137 | The following values indicate standard rounding modes: | |
138 | ||
139 | @table @code | |
140 | @item -1 | |
141 | The mode is indeterminable. | |
142 | @item 0 | |
143 | Rounding is towards zero. | |
144 | @item 1 | |
145 | Rounding is to the nearest number. | |
146 | @item 2 | |
147 | Rounding is towards positive infinity. | |
148 | @item 3 | |
149 | Rounding is towards negative infinity. | |
150 | @end table | |
151 | ||
152 | @noindent | |
153 | Any other value represents a machine-dependent nonstandard rounding | |
154 | mode. | |
155 | @end defvr | |
156 | ||
157 | @comment float.h | |
f65fd747 | 158 | @comment ISO |
28f540f4 RM |
159 | @defvr Macro FLT_RADIX |
160 | This is the value of the base, or radix, of exponent representation. | |
161 | This is guaranteed to be a constant expression, unlike the other macros | |
162 | described in this section. | |
163 | @end defvr | |
164 | ||
165 | @comment float.h | |
f65fd747 | 166 | @comment ISO |
28f540f4 RM |
167 | @defvr Macro FLT_MANT_DIG |
168 | This is the number of base-@code{FLT_RADIX} digits in the floating-point | |
169 | mantissa for the @code{float} data type. | |
170 | @end defvr | |
171 | ||
172 | @comment float.h | |
f65fd747 | 173 | @comment ISO |
28f540f4 RM |
174 | @defvr Macro DBL_MANT_DIG |
175 | This is the number of base-@code{FLT_RADIX} digits in the floating-point | |
176 | mantissa for the @code{double} data type. | |
177 | @end defvr | |
178 | ||
179 | @comment float.h | |
f65fd747 | 180 | @comment ISO |
28f540f4 RM |
181 | @defvr Macro LDBL_MANT_DIG |
182 | This is the number of base-@code{FLT_RADIX} digits in the floating-point | |
183 | mantissa for the @code{long double} data type. | |
184 | @end defvr | |
185 | ||
186 | @comment float.h | |
f65fd747 | 187 | @comment ISO |
28f540f4 RM |
188 | @defvr Macro FLT_DIG |
189 | This is the number of decimal digits of precision for the @code{float} | |
190 | data type. Technically, if @var{p} and @var{b} are the precision and | |
191 | base (respectively) for the representation, then the decimal precision | |
192 | @var{q} is the maximum number of decimal digits such that any floating | |
193 | point number with @var{q} base 10 digits can be rounded to a floating | |
194 | point number with @var{p} base @var{b} digits and back again, without | |
195 | change to the @var{q} decimal digits. | |
196 | ||
197 | The value of this macro is guaranteed to be at least @code{6}. | |
198 | @end defvr | |
199 | ||
200 | @comment float.h | |
f65fd747 | 201 | @comment ISO |
28f540f4 RM |
202 | @defvr Macro DBL_DIG |
203 | This is similar to @code{FLT_DIG}, but is for the @code{double} data | |
204 | type. The value of this macro is guaranteed to be at least @code{10}. | |
205 | @end defvr | |
206 | ||
207 | @comment float.h | |
f65fd747 | 208 | @comment ISO |
28f540f4 RM |
209 | @defvr Macro LDBL_DIG |
210 | This is similar to @code{FLT_DIG}, but is for the @code{long double} | |
211 | data type. The value of this macro is guaranteed to be at least | |
212 | @code{10}. | |
213 | @end defvr | |
214 | ||
215 | @comment float.h | |
f65fd747 | 216 | @comment ISO |
28f540f4 RM |
217 | @defvr Macro FLT_MIN_EXP |
218 | This is the minimum negative integer such that the mathematical value | |
219 | @code{FLT_RADIX} raised to this power minus 1 can be represented as a | |
220 | normalized floating-point number of type @code{float}. In terms of the | |
221 | actual implementation, this is just the smallest value that can be | |
222 | represented in the exponent field of the number. | |
223 | @end defvr | |
224 | ||
225 | @comment float.h | |
f65fd747 | 226 | @comment ISO |
28f540f4 RM |
227 | @defvr Macro DBL_MIN_EXP |
228 | This is similar to @code{FLT_MIN_EXP}, but is for the @code{double} data | |
229 | type. | |
230 | @end defvr | |
231 | ||
232 | @comment float.h | |
f65fd747 | 233 | @comment ISO |
28f540f4 RM |
234 | @defvr Macro LDBL_MIN_EXP |
235 | This is similar to @code{FLT_MIN_EXP}, but is for the @code{long double} | |
236 | data type. | |
237 | @end defvr | |
238 | ||
239 | @comment float.h | |
f65fd747 | 240 | @comment ISO |
28f540f4 RM |
241 | @defvr Macro FLT_MIN_10_EXP |
242 | This is the minimum negative integer such that the mathematical value | |
243 | @code{10} raised to this power minus 1 can be represented as a | |
244 | normalized floating-point number of type @code{float}. This is | |
245 | guaranteed to be no greater than @code{-37}. | |
246 | @end defvr | |
247 | ||
248 | @comment float.h | |
f65fd747 | 249 | @comment ISO |
28f540f4 RM |
250 | @defvr Macro DBL_MIN_10_EXP |
251 | This is similar to @code{FLT_MIN_10_EXP}, but is for the @code{double} | |
252 | data type. | |
253 | @end defvr | |
254 | ||
255 | @comment float.h | |
f65fd747 | 256 | @comment ISO |
28f540f4 RM |
257 | @defvr Macro LDBL_MIN_10_EXP |
258 | This is similar to @code{FLT_MIN_10_EXP}, but is for the @code{long | |
259 | double} data type. | |
260 | @end defvr | |
261 | ||
262 | ||
263 | ||
264 | @comment float.h | |
f65fd747 | 265 | @comment ISO |
28f540f4 RM |
266 | @defvr Macro FLT_MAX_EXP |
267 | This is the maximum negative integer such that the mathematical value | |
268 | @code{FLT_RADIX} raised to this power minus 1 can be represented as a | |
269 | floating-point number of type @code{float}. In terms of the actual | |
270 | implementation, this is just the largest value that can be represented | |
271 | in the exponent field of the number. | |
272 | @end defvr | |
273 | ||
274 | @comment float.h | |
f65fd747 | 275 | @comment ISO |
28f540f4 RM |
276 | @defvr Macro DBL_MAX_EXP |
277 | This is similar to @code{FLT_MAX_EXP}, but is for the @code{double} data | |
278 | type. | |
279 | @end defvr | |
280 | ||
281 | @comment float.h | |
f65fd747 | 282 | @comment ISO |
28f540f4 RM |
283 | @defvr Macro LDBL_MAX_EXP |
284 | This is similar to @code{FLT_MAX_EXP}, but is for the @code{long double} | |
285 | data type. | |
286 | @end defvr | |
287 | ||
288 | @comment float.h | |
f65fd747 | 289 | @comment ISO |
28f540f4 RM |
290 | @defvr Macro FLT_MAX_10_EXP |
291 | This is the maximum negative integer such that the mathematical value | |
292 | @code{10} raised to this power minus 1 can be represented as a | |
293 | normalized floating-point number of type @code{float}. This is | |
294 | guaranteed to be at least @code{37}. | |
295 | @end defvr | |
296 | ||
297 | @comment float.h | |
f65fd747 | 298 | @comment ISO |
28f540f4 RM |
299 | @defvr Macro DBL_MAX_10_EXP |
300 | This is similar to @code{FLT_MAX_10_EXP}, but is for the @code{double} | |
301 | data type. | |
302 | @end defvr | |
303 | ||
304 | @comment float.h | |
f65fd747 | 305 | @comment ISO |
28f540f4 RM |
306 | @defvr Macro LDBL_MAX_10_EXP |
307 | This is similar to @code{FLT_MAX_10_EXP}, but is for the @code{long | |
308 | double} data type. | |
309 | @end defvr | |
310 | ||
311 | ||
312 | @comment float.h | |
f65fd747 | 313 | @comment ISO |
28f540f4 RM |
314 | @defvr Macro FLT_MAX |
315 | The value of this macro is the maximum representable floating-point | |
316 | number of type @code{float}, and is guaranteed to be at least | |
317 | @code{1E+37}. | |
318 | @end defvr | |
319 | ||
320 | @comment float.h | |
f65fd747 | 321 | @comment ISO |
28f540f4 RM |
322 | @defvr Macro DBL_MAX |
323 | The value of this macro is the maximum representable floating-point | |
324 | number of type @code{double}, and is guaranteed to be at least | |
325 | @code{1E+37}. | |
326 | @end defvr | |
327 | ||
328 | @comment float.h | |
f65fd747 | 329 | @comment ISO |
28f540f4 RM |
330 | @defvr Macro LDBL_MAX |
331 | The value of this macro is the maximum representable floating-point | |
332 | number of type @code{long double}, and is guaranteed to be at least | |
333 | @code{1E+37}. | |
334 | @end defvr | |
335 | ||
336 | ||
337 | @comment float.h | |
f65fd747 | 338 | @comment ISO |
28f540f4 RM |
339 | @defvr Macro FLT_MIN |
340 | The value of this macro is the minimum normalized positive | |
341 | floating-point number that is representable by type @code{float}, and is | |
342 | guaranteed to be no more than @code{1E-37}. | |
343 | @end defvr | |
344 | ||
345 | @comment float.h | |
f65fd747 | 346 | @comment ISO |
28f540f4 RM |
347 | @defvr Macro DBL_MIN |
348 | The value of this macro is the minimum normalized positive | |
349 | floating-point number that is representable by type @code{double}, and | |
350 | is guaranteed to be no more than @code{1E-37}. | |
351 | @end defvr | |
352 | ||
353 | @comment float.h | |
f65fd747 | 354 | @comment ISO |
28f540f4 RM |
355 | @defvr Macro LDBL_MIN |
356 | The value of this macro is the minimum normalized positive | |
357 | floating-point number that is representable by type @code{long double}, | |
358 | and is guaranteed to be no more than @code{1E-37}. | |
359 | @end defvr | |
360 | ||
361 | ||
362 | @comment float.h | |
f65fd747 | 363 | @comment ISO |
28f540f4 RM |
364 | @defvr Macro FLT_EPSILON |
365 | This is the minimum positive floating-point number of type @code{float} | |
366 | such that @code{1.0 + FLT_EPSILON != 1.0} is true. It's guaranteed to | |
367 | be no greater than @code{1E-5}. | |
368 | @end defvr | |
369 | ||
370 | @comment float.h | |
f65fd747 | 371 | @comment ISO |
28f540f4 RM |
372 | @defvr Macro DBL_EPSILON |
373 | This is similar to @code{FLT_EPSILON}, but is for the @code{double} | |
374 | type. The maximum value is @code{1E-9}. | |
375 | @end defvr | |
376 | ||
377 | @comment float.h | |
f65fd747 | 378 | @comment ISO |
28f540f4 RM |
379 | @defvr Macro LDBL_EPSILON |
380 | This is similar to @code{FLT_EPSILON}, but is for the @code{long double} | |
381 | type. The maximum value is @code{1E-9}. | |
382 | @end defvr | |
383 | ||
384 | ||
385 | ||
386 | @node IEEE Floating Point | |
387 | @section IEEE Floating Point | |
388 | ||
389 | Here is an example showing how these parameters work for a common | |
390 | floating point representation, specified by the @cite{IEEE Standard for | |
f65fd747 UD |
391 | Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985 or ANSI/IEEE |
392 | Std 854-1987)}. | |
28f540f4 RM |
393 | |
394 | The IEEE single-precision float representation uses a base of 2. There | |
395 | is a sign bit, a mantissa with 23 bits plus one hidden bit (so the total | |
396 | precision is 24 base-2 digits), and an 8-bit exponent that can represent | |
397 | values in the range -125 to 128, inclusive. | |
398 | ||
399 | So, for an implementation that uses this representation for the | |
400 | @code{float} data type, appropriate values for the corresponding | |
401 | parameters are: | |
402 | ||
403 | @example | |
404 | FLT_RADIX 2 | |
405 | FLT_MANT_DIG 24 | |
406 | FLT_DIG 6 | |
407 | FLT_MIN_EXP -125 | |
408 | FLT_MIN_10_EXP -37 | |
409 | FLT_MAX_EXP 128 | |
410 | FLT_MAX_10_EXP +38 | |
411 | FLT_MIN 1.17549435E-38F | |
412 | FLT_MAX 3.40282347E+38F | |
413 | FLT_EPSILON 1.19209290E-07F | |
414 | @end example |