The IEEE 754-1985

floating point standard is implemented on most computers as a way of representing non-integer numbers. It defines 4

types:

half(binary16)

single(binary32)

double(binary64)

quadruple(binary128)

All of these work basicly the same, but with different sizes. Each of these is made of the same three pieces:

Sign bit: if this is set, the number is negative. Always one bit

Exponent: the size varies, but this stores the power of 2 to multiply the significand by. This uses a binary offset format, so the actual value the significand is multiplied by is equal to 2^(exponent-offset). The offset is equal to (2^(numberOfBitsInExponent))-1.

Significand: This is stored with an implicit 1 bit before the start, so the value should be treated as 1.significandBits.

### Exceptions

The problem with what has been outlined above is that there is no way to store 0,

infinity or

NaN. The special cases are outlined below:

exponent=0:0 or -0, depending on the sign bit.

exponent=0x1f for half, 0xff for single, 0x7ff for double or 0x7fff for quadruple.

if the significand is 0, then infinity

if the significand is not 0, then NaN

More information can be found at:

http://en.wikipedia.org/wiki/Half_precision_floating-point_format

http://en.wikipedia.org/wiki/Single_precision_floating-point_format

http://en.wikipedia.org/wiki/Double_precision_floating-point_format

http://en.wikipedia.org/wiki/Quadruple_precision_floating-point_format