The IEEE 754-1985
floating point standard is implemented on most computers as a way of representing non-integer numbers. It defines 4
types:
half(binary16)
single(binary32)
double(binary64)
quadruple(binary128)
All of these work basicly the same, but with different sizes. Each of these is made of the same three pieces:
Sign bit: if this is set, the number is negative. Always one bit
Exponent: the size varies, but this stores the power of 2 to multiply the significand by. This uses a binary offset format, so the actual value the significand is multiplied by is equal to 2^(exponent-offset). The offset is equal to (2^(numberOfBitsInExponent))-1.
Significand: This is stored with an implicit 1 bit before the start, so the value should be treated as 1.significandBits.
Exceptions
The problem with what has been outlined above is that there is no way to store 0,
infinity or
NaN. The special cases are outlined below:
exponent=0:0 or -0, depending on the sign bit.
exponent=0x1f for half, 0xff for single, 0x7ff for double or 0x7fff for quadruple.
if the significand is 0, then infinity
if the significand is not 0, then NaN
More information can be found at:
http://en.wikipedia.org/wiki/Half_precision_floating-point_format
http://en.wikipedia.org/wiki/Single_precision_floating-point_format
http://en.wikipedia.org/wiki/Double_precision_floating-point_format
http://en.wikipedia.org/wiki/Quadruple_precision_floating-point_format