"If people wanted correct answers then they wouldn't use a computer"
-spoken by the author of this w/u on more than one occasion
Floating point is all about wrong answers and how to avoid and/or minimize them.
The primary source of errors
in a (properly specified, designed and implemented) floating point system isn't the floating point arithmetic itself but rather the simple fact that the results of the floating point arithmetic must be represented (i.e. stored) in a finite amount of space1
Given a pair of real numbers represented exactly in some floating point number format, it isn't all that difficult to compute the exact sum, difference, product, quotient or remainder of the pair.
The result of such a computation might require more space to represent it exactly than either of the two original real numbers required.
The problem is that preserving perfect accuracy throughout even a modest sized and relatively simple computation can quickly require truly vast amounts of space.
Things get even uglier when other operations are performed.
For example, the square root of the vast majority of real numbers is an irrational number (i.e. a number which cannot be represented by the ratio of two integers and whose numeric representation is an infinite sequences of digits which never repeats itself).
Computing the square root of any given (positive) real number to an arbitrary number of digits isn't all that difficult.
The problem is that the exact value often cannot be exactly represented in a fixed or even finite amount of space.
This w/u discusses some of the issues surrounding floating point number representation
The discussion begins with a look at real numbers . . .
Any real number can be represented by four values as follows:
(-1)sign * significand * bexponent
is a binary digit (i.e. 0 or 1) indicating the sign of the number.
or radix b
is a somewhat arbitrary constant ("tradition" says that it should probably be 10).
is selected to ensure that the significand
is within a suitable range (e.g. one non-zero digit before the decimal except in the case where the significand is zero).
Floating point number representation systems are all based on the observation that by selecting an appropriate base and placing appropriate limits on the sizes of the exponent and significand, it becomes possible to at least approximately represent any real number within a useful range of values and with a useful degree of precision.
The key to coming up with a "useful" floating point number representation system is selecting the right base and placing the appropriate limits on the sizes of the exponent and significand.
A "useful" system for representing floating point numbers must be able to:
- represent an interesting range of real numbers (e.g. values between -10-100 and 10+100)
- distinguish between relatively nearby real numbers (e.g. able to distinguish between two real numbers which differ in the 10th significant decimal digit)
- represent interesting numbers in a modest amount of space
Experience (i.e. IEEE 754 and IEEE 854) has shown that being able to represent certain special values, like positive and negative infinity, negative zero and NaNs can be quite useful.
I hesitate to suggest that such an ability is essential for the simple reason that there have been quite successful floating point number representation systems which cannot represent such values2.
There is at least one other characteristic which a useful floating point system should probably possess:
If a, b and c are three real numbers which can all be represented by a particular floating point number representation system and if there are no real numbers between a and b or between b and c which can also be represented by the system then the distance between a and b should be reasonably similar to the distance between b and c.
In other words, there shouldn't be sudden jumps between the successive real numbers which can be exactly represented by the system.
The argument against sudden jumps is really quite simple:
If it's acceptable for there to be a gap of a certain size between successive representable real numbers a and b then a similar gap size between successive representable real numbers b and c is surely also acceptable.
If the gap between a and b isn't reasonably similar to the gap between b and c then one of the two gap sizes is either unreasonably large or unnecessarily small.
In any given situation, the choice of base and the required range of representable values determines the amount of space which will be consumed by the exponent.
The total amount of space available to represent a single number minus the space consumed by the exponent and the one bit sign determines the space available for the significand.
Consequently, the choice of base is generally considered to be the most fundamental decision.
The author of this w/u is familiar with floating point number representation systems in which the base was one of 16, 10 and 2.
Let's consider each of these alternative base values in turn:
- a base of 16 has proven attractive as it allows a quite modest number of bits to represent a rather large range of potential values.
For example, an 8 bit signed value could represent values in the range 16-128 through to 16+127 (i.e. roughly 10-150 through to 10+150).
The problem is that each change in the exponent by one causes the significand to shift by four bits.
This causes sudden jumps by a factor of 16 in the size of the gap between successive representable numbers when successive numbers must be represented by different exponents.
- a base of 10 has proven attractive as it most closely resembles how humans think of numbers.
This is particularily useful when considering how fractional values are represented:
All of the powers of the decimal value 0.1 (e.g. 100, 10, 1, 0.1, 0.01, 0.001, etc) can be exactly represented if the base is 10 by setting the significand to 1 and using the appropriate exponent (e.g. 100 is 1*102, 0.1 is 1*10-1, etc).
This is a very attractive property when dealing with quantities which represent exact decimal fractional values as it means that such values can be represented exactly.
Of course, a base of 10 also suffers from the problem that the gap size between very nearby pairs of representable numbers is irregular.
A base 10 floating point system is often called a decimal floating point system3.
- a base of 2 has also proven attractive as it rather firmly solves the problem of sudden large changes in the gap size between nearby successive pairs of representable values (the change in gap size between the corrresponding pairs of three successive representable values is never more than a factor of 2).
The most important downsides of a base 2 representation are probably that decimal fractions can't be represented exactly (base 16 representations also have this problem) and exponents require more bits to represent values in a given range than exponents of larger bases would require to represent the same range.
Once the base has been selected, the key remaining questions are:
- what is the size of a floating point value in terms of the space that it occupies in memory?
i.e. how many bits should be used to represent values?
- of these bits, how many should be used for the exponent and how many should be used for the significand?
Practically all of the floating point representation systems which have been developed over the years support two different sizes - a 32 bit format for applications willing to trade off precision and possibly range for a compact representation
and a 64 bit format for applications less willing to make such a tradeoff.
The tradeoff between range and precision is fairly simple to express if not particularily easy to resolve to everyone's satisfaction -
assigning more bits to the exponent (i.e. increasing the range of representable values) reduces the number of bits available for the significand (i.e. precision of representable values).
The designers of the IEEE 754 binary floating point arithmetic standard selected a base of 2 because it avoided the sudden jumps in representable values suffered by larger bases.
They then took the position that the 32 bit format would set aside 1 bit for the sign, 8 bits for the exponent and 23 bits for the significand.
This format represents values in the range 2-128 (about 10-39) and 2+127 (about 10+39).
It also provides a precision of about 7 digits as 224 is about 16,000,000.
Their 64 bit format sets aside three additional bits for the exponent (i.e. 11 bits) while leaving 52 bits for the significand.
This provides a considerably increased range of roughly 10-300 to 10+300 and a rough doubling of precision to about 15 decimal digits.
See the IEEE 754
floating point standard for an example of a carefully specified, designed and typically well implemented floating point implementation in which all operations are required to produce exact results which are then rounded to make them fit in the space specified by the programmer.
The sole exception is that input/output conversions (i.e. conversion between strings of decimal digits and binary floating point values) is allowed to be just slightly less than perfectly accurate before rounding.
The IBM System/360
floating point system is one of many floating point number representation systems which had no way of representing infinities, signed zeros or NaNs.
Anyone prepared to suggest that the System/360 floating point system wasn't successful should spend a bit more time reading history of computing books.
The System/360 floating point system evolved over the years and became the System/370
Neither system could represent infinities, signed zeros or NaNs (strictly speaking, one could represent a negative zero but the system treated such a value as a regular zero in all respects).
The IEEE 854
standard describes a floating point system which can be implemented using base 10 or base 2.
Courtesy of Swap, I bring you the following example of how floating point arithmetic can get one into trouble:
Consider the following C program:
The output of this program on a computer which implements the IEEE Standard 754 for Binary Floating-Point Arithmetic is
printf( "( 0.3 - 0.2 ) - 0.1 == %g\n", ( 0.3 - 0.2 ) - 0.1 );
( 0.3 - 0.2 ) - 0.1 == -2.77556e-17
The reason for the (presumably) disconcerting result is that it is not possible to exactly represent 0.3 or 0.2 or even 0.1 in the IEEE 754 double precision format. To make matters worse, the IEEE 754 value of ( 0.3 - 0.2 ) is different than the IEEE 754 value of 0.1 with the result that when you subtract the IEEE 754 value of 0.1 from the IEEE 754 value of ( 0.3 - 0.2 ), you get a value which is almost but not quite zero!