Floating point/Lesson Four

Single Precision
In most computers, the IEEE sets a standard on how they should store numbers. This ensures that computer scientists are able to worry about the error, not learn how their particular computer operates.

Single precision numbers are numbers stored according to these rules:

1) Numbers are converted to ±k × 2m, where k is a binary number in form 1.f and m is the exponent. The number k is between 1 and 2, but you won't use the first digit, because that is assumed.

2) The number, k, is rounded so that it only contains 24 bits. 3) The exponent can be from -126 to +127. It is stored in the computer as m + 127.

4) The computer stores the following:

a) The sign bit (1 if negative, 0 if positive)

b) The biased exponent (8 bits, 00000001 to 11111110, signifying 1 to 254, but actually representing -126 to 127).

c) The actual number. This section is called the mantissa.  Because the number is assumed to have a 1 at the beginning, the 1 is not stored, but the next 23 digits are.  The 1 is called the hidden bit.

The point between the exponent and the mantissa is known as the radix point. See the IEEE Single Precision picture.

Single Precision Example
The number 1 would be stored as the following ('1' is equal to 1 × 2°): 0       01111111           00000000000000000000000 positive exponent equal to  the first '1' is not included, number  127 (=0 + 127)     next numbers are

Machine Epsilon, Single Precision
Machine epsilon, as a reminder, is the smallest possible number such that 1 + ε ≠ 1 on the machine. There are 23 bits available on the mantissa from the example above. Thus, as soon as 2-23 is added, another '1' will be stored, namely the mantissa will read 00000000000000000000001. Thus, ε = 2-23.

Matlab Code Example
Here, we demonstrate that machine epsilon is indeed 2-23. The command 'single' forces Matlab to store the number as a single number. >> single ( 1 - single ( 1 + 2^(-25) ) )

ans =

0

>> single ( 1 - single ( 1 + 2^(-24) ) )

ans =

0

>> single ( 1 - single ( 1 + 2^(-23) ) )

ans =

-1.1921e-007

Double Precision
Double precision operates in the same manner as single precision, except more space is allocated to a number. Again, we have 1 sign bit, but we also have 11 bits for the exponent and 52 bits for the mantissa. The exponent is biased again, but this time, it is by 1023. Machine epsilon is 2-52.

Interesting Proof
Here, we prove that the relative error of storing a number in single precision (indeed, any precision) is simply machine epsilon divided by 2. Chopping (not rounding) results in a relative error of ε.

Denote x- as the machine number below the actual number, and x+ as the number above. In single precision, x-=(0.1b1b2b3...b24)2 × 2k. Additionally, x+=[(0.1b1b2b3...b24)2 + 2-24] × 2k.

Assume without loss of generality that x is closer to x-. Then, |x - x-| ≤ (1/2) |x+ - x-| = 2-25 + k.

Then, $$ \left| \frac{x - x_{-}}{x} \right| \le \frac{2^{-25+k}}{(0.1b_2 b_3 b_4 \ldots)_2 \times 2^k } \le \frac{2^{-25}}{\frac{1}{2}} = 2^{-24} = \frac{1}{2} \epsilon $$.

Special Numbers
The 0 and 255 exponents, a -0 entry, and other values represent certain special numbers: (copied from IEEE 754-1985)

Homework
1) What is the difference between machine epsilon and realmin?

2) In single precision, what will the following computations yield?

1 + ε

1 + realmin

realmin + ε

(2 + ε) - 1

(-1 + ε) + 2

3) Find realmax in single precision, both by hand, and by using Matlab.