Integer number formats cannot store fractional values.
We can get around this using a fixed point format, but this
severely limits both the range and the number of fractional
digits. A 64-bit fixed point system with 32 whole number
digits and 32 fractional digits has a maximum range of
2^{32} - 1 even when we don't need
the fractional bits. A system that utilizes all the bits
where they are needed would generally be a better option
for approximating real numbers.

Floating point gets around the limitations of fixed point by using a format similar to scientific notation.

3.52 * 10^{3} = 3520

A scientific notation number consists of a
*mantissa* (3.52 in the example above)
a *radix* (always 10), and an
*exponent* (3 in the example above).
Hence, the general format of a scientific notation value is:

mantissa * radix^{exponent}

The *normalized* form always has a mantissa
greater than or equal to 1.0, and strictly less than 10.0.
Written
mathematically, the mantissa is in the set [1.0,10.0). As you
may recall from algebra, [] encloses a set including the
extreme values and () encloses a set excluding them. We can
denormalize the value and express it in many other ways, such
as 35.2 * 10^{2}, or 0.00325 x
10^{6}. For each position we shift
the digits of the mantissa relative to the decimal point, we are
multiplying or dividing the mantissa by a factor of 10.
To compensate for this, we simply increase or decrease the
exponent. Adding 1 to the exponent multiplies
10^{exponent} by 10, for example.
this compensates for shifting the mantissa digits one position
to the right, which divides the mantissa by 10.

Denormalizing is necessary when adding scientific notation values with different exponents:

3.52 * 10^3 + 1.97 * 10^5 ----------------------------------------- 1 0.0352 * 10^5 + 1.97 * 10^5 ----------------------------------------- 2.0052 * 10^5

Adjusting the mantissa and exponent is also sometimes necessary to normalize results. For example, if we had chosen to denormalize the other value above, we would see the following:

3.52 * 10^3 + 1.97 * 10^5 ----------------------------------------- 1 3.52 * 10^3 + 197.0 * 10^3 ----------------------------------------- 200.52 * 10^3 Not normalized 2.0052 * 10^5 Normalized

For this reason, it's best to denormalize the number with the smaller exponent. This way, the result of the addition is likely to be already normalized.

A binary floating system is basically the same, but
stores a signed binary mantissa and
a signed binary exponent, and usually uses a radix of 2.
Using a radix of 2
(or any power of 2) allows us to normalize and denormalize by
simply shifting the binary digits in the mantissa and adding
or subtracting from the exponent, just as we do with scientific
notation.
Shifting binary digits
in the mantissa n bits to the left or right multiplies or divides
the mantissa by 2^{n}, just as shifting
decimal digits multiplies or divides by 10.

00010_{2} * 2^{3}
= 01000_{2} * 2^{1}.

Standard floating point formats are defined by the IEEE society. The IEEE formats are slightly more complex than necessary to understand floating point in general, so we will start with a simpler example here.

Many people get confused about floating point because of their
inability to abstract it into a hierarchy of simple components.
If we solve one piece of the puzzle at a time
and then put them together, it's
no more difficult than fixed point or integers. It just
requires a simple extra step such as computing
`mantissa * radix ^ exponent`

. Don't think about
the floating point format overall or the exponent while
converting the mantissa. Just understand the mantissa and
deal with it separately. This is a classic example of
*divide and conquer*, which is the
whole point of abstraction.

Suppose a 32-bit floating point format has a 24-bit two's complement mantissa, an 8-bit two's complement exponent, and a radix of 2. The general structure is:

mantissa * 2^{exponent}

The binary format is as follows, labeling bits starting with 0 on the right as we would with an unsigned integer:

31 8 7 0 +--------------------------------+ | mantissa | exponent | +--------------------------------+

We do not need to store the radix, since it never changes and can just be assumed by the hardware design.

What is the value of the following number?

000000000000000000010010 11111100

The mantissa is 000000000000000000010010, or +(2 + 16) = +18

_{10}.The exponent is 11111100 = -(00000011 + 1) = -00000100 = -4

_{10}.The value is therefore +18 * 2

^{-4}, or +1.125_10.What is the largest positive value we can represent in this system?

The largest positive value will consist of the largest positive mantissa and the largest positive exponent. Since we know the range of any two's complement system from the section called “Two's Complement”, this is easy to figure out.

The largest positive mantissa is 011111111111111111111111, which in two's complement is +2

^{23}-1 = +8388607. The largest exponent is 01111111, which in two's complement is +2^{7}-1 (+127).Hence, the largest positive value is +8388607 * 2

^{+127}= 1.42 * 10^{45}.Note that this is a

*much*larger number than we can store in 1 32-bit integer. But with 32 bits, there are exactly 2^{32}*different*patterns of 0s and 1s possible, so we cannot represent more than 2^{32}different numbers.The obvious catch, then, is that we cannot store every integer value up to 1.42 * 10

^{45}. So what is the second largest positive value? What is the difference between the largest and second largest? This would be the second largest mantissa and the largest exponent, so +8388606 * 2^{+127}. This value is equal to the largest value minus 2^{127}. As we can see, the larger the exponent, the more spread out are the consecutive values that our system can represent.What is the smallest positive value? For an integer system, this is always 1. To find the smallest positive value in the form mantissa * radix

^{exponent}, we choose the smallest positive mantissa, and the smallest*negative*exponent (the negative exponent with the largest magnitude).Since the mantissa is an integer, the smallest positive value possible is 1.

Since the exponent is an 8-bit two's complement value, the smallest negative exponent is 10000000

_{2}, of -2^{7}= -128.Hence the smallest positive value is 1 x 2

^{-128}, or 2.93873587706 x 10^{-39}. The next smallest would be this value + 10^{-39}. So we can see that consecutive values close to 0 are very close together.What is the precision of the system?

The precision of any system is the maximum number of significant figures, and is determined by the mantissa since the exponent only adds leading or trailing 0s to the value (i.e. shifts the digits in the mantissa left or right). The precision is therefore 23 bits, which is the number of bits that contribute to the magnitude of a two's complement integer.

### Note

A 32-bit integer is significantly more precise than a 32-bit floating point value, since floating point sacrifices some bits of precision to increase range and allow for fractional values.Represent -2.75 in this floating point system.

Convert the number to fixed point binary using the methods described in previous sections:

-2.75

_{10}= -(10.11_{2})Multiply by radix

^{exponent}equal to 1 so we have all the components:-2.75

_{10}= -(10.11_{2}) * 2^{0}Shift the binary point to make the mantissa a whole number: -(1011

_{2})By moving the binary point two places to the right, we multiplied the mantissa by 2

^{2}. We therefore must divide the other factor (2^{exponent}) by 2^{2}in order to preserve the overall value. This is is just a matter of subtracting 2 from the exponent.-10.11

_{2}* 2^{0}= -(1011_{2}) * 2^{-2}Convert the mantissa and exponent into the specified formats. The mantissa in this case is 24-bit two's complement and the exponent is 8-bit two's complement.

Mantissa: -(000000000000000000001011) = 111111111111111111110101

Exponent: -2

_{10}= 11111110Place the mantissa and exponent in their positions in the 32-bit floating point format:

Binary representation = 11111111111111111111010111111110

*Overflow* occurs when the result of
a floating point operation is larger than the largest
positive value, or smaller than the smallest negative value.
In other words, the magnitude is too large to represent.

*Underflow* occurs when the result of
a floating point operation is between the smallest possible
positive value and the largest possible negative value.
In other words, the magnitude is too small to represent.

The example 32-bit format above cannot represent values larger
than 1.42 * 10^{45} or smaller than
2.93873587706 * 10^{-39}.

One technique to avoid overflow and underflow is to alternate operations that increase and decrease intermediate results. Rather than do all the multiplications (by values greater than 1) first, which could cause overflow, or all the divisions first, which could cause underflow, we could alternate multiplications and divisions to moderate the results along the way. Techniques like these must often be used in scientific calculations.

Using 4-bit unsigned integers as a simpler example of the concept, if we try to compute 4 * 4 / 2, we will encounter an overflow because 4 * 4 is larger than the maximum value of 15 that we can represent in 4-bits. If, on the other hand, we compute it as 4 / 2 * 4, we get the correct result. The same principle applies to all systems with limited range.

Everything has a cost. The increased range and the ability to represent non-whole numbers is no exception.

There are only 2^{32} possible
patterns of 32 0s and 1s. Hence, there are at most
2^{32} unique numbers that
we can represent with 32 bits, regardless of the format.

So how is it that we can represent numbers up to
10^{45}?

Obviously, we must be sacrificing something in between. What floating point does for us is spread out the limited number of binary patterns we have available to cover a larger range of numbers. The larger the exponent, the larger the gap between consecutive numbers that we can accurately represent.

The gap between consecutive numbers that we can represent gets larger as the exponent grows.

The precision of a 32-bit floating point value is less than the precision of a 32-bit integer. By using 8 bits for the exponent, we sacrifice those 8 bits of precision. Hence, our example format has the same precision as a 24-bit signed integer system.

Arithmetic on floating point is several times slower than on integers. This is an inherent property of the more complex format.

Consider the process of adding two scientific notation values:

- Equalize the exponents
- Add the mantissas
- Normalize the result

Each of these operations take roughly the same amount of time in a computer as a single integer addition. Since floating point is stored like scientific notation, it uses a similar process and we can expect floating point addition to take about three times as long as integer addition. In reality, a typical PC takes about 2.5 times as long to run a loop doing mostly floating point arithmetic as it does to do the same loop with integers.

Note that this applies only to operations that can be carried out using either a single integer instruction or a single floating point instruction. For example, suppose a program is running on a 32-bit computer, and the data may be outside the range of a 32-bit integer. In this case, multiple integer instructions will be necessary to process integer values of more than 64 bits, and much of the speed advantage of integers is lost.

If hardware has floating point support built-in, then common operations like floating point addition, subtraction, etc. can each be handled by a single instruction. If hardware doesn't have a floating point unit (common in microcontrollers), floating point operations must be handled by software routines. Hence, adding two floating point values will require dozens of integer instructions to complete instead of just one floating point instruction.

Most algorithms can be implemented using integers instead of floating point with a little thought. Use of floating point is often the result of sheer laziness. Don't use floating point just because it's intuitive. Think about whether it is really necessary. The answer is usually no.

**Example 15.8. Improving Speed with Integers**

The value of PI can be estimated using a Monte Carlo simulation by generating uniform random coordinates across a square "dart board" with an inscribed circle. Since the darts are distributed in a uniform random fashion, the probability of one landing in the circle is the area of the circle divided by the area of the square. This ratio is PI/4.

After generating many random points, we simply divide
the number within the circle by the total to get an
estimate of PI/4.
The intuitive approach is a use a 1 x 1 square and
generate random x and y values between 0 and 1.
However, this approach requires the use of floating point.
if we instead make the dimensions of the square
our maximum integer size
(e.g. 2^{32}-1) and generate
integer x and y values instead, we can eliminate
all floating point calculations from the program.

In fact, the integer implementation of this program is more than twice as fast as the floating point implementation. Full details can be found in the Research Computing User's Guide.

Use of floating point also results in more power consumption. This is not usually noticeable on a desktop PC, but can become a problem on large grids consisting of hundreds of PCs, since the power grid they are attached to may not be designed to provide for their maximum draw. It can also be a problem when running a mobile device on battery while doing intensive computations. Battery life while doing intensive floating point computations could be a small fraction of what it is while reading email, browsing the web, or editing a document.

Because floating point suffers from round-off error, it is
not safe to directly compare floating point values in a
program. For example, the following program will fail
due to the fact that 0.1 cannot be represented perfectly
with a limited number of digits. Instead of stopping at
1.0, `x`

misses the value by a tiny amount,
and an infinite loop occurs.

#include <stdio.h> #include <sysexits.h> int main(int argc,char *argv[]) { double x; for (x = 0.0; x != 1.0; x += 0.1) printf("%0.16f\n", x); return EX_OK; }

Actual output:

0.0000000000000000 0.1000000000000000 0.2000000000000000 0.3000000000000000 0.4000000000000000 0.5000000000000000 0.6000000000000000 0.7000000000000000 0.7999999999999999 0.8999999999999999 0.9999999999999999 1.0999999999999999 1.2000000000000000 1.3000000000000000 1.4000000000000001 1.5000000000000002 ...

The example floating point systems above are only for general
understanding and context
and not used in real systems. Use of integer mantissa formats suffers
from problems such as
multiple representations of the same value, e.g.
1 * 2^{2} = 2 * 2^{1}.

The IEEE floating point formats have become standard on most CPUs. They are highly optimized and contain special features, like representations for infinity and not-a-number (NaN).

S is the sign bit. Like integer formats, 1 indicates a negative number and 0 indicates positive.

F is the *fractional part* of the mantissa.
The non-fractional part is
not stored, and always 1, so the actual mantissa is
1.F_{2}. This method provides an extra bit of
precision that need not be stored, so we have a 24-bit fixed point
mantissa in reality, even though we only store 23 bits. It also
ensures that there is only 1 way to represent a given value,
unlike the example formats above with integer mantissas.

The 8-bit exponent is stored in bias-127 notation, hence E is an unsigned integer with the value exponent + 127.

The absolute value of the number is 1.F x
2^{E-127}, except for some special cases:

- If E = 255, and F is non-zero, then the value is NaN. (Not a Number)
- If E = 255, F = 0 and S = 1, then the value is -infinity.
- If E = 255, F = 0, and S = 0, then the value is +infinity.
- If E = 0, and F = 0, then the value is 0.

The floating-point hardware uses more than 23 bits for computation to avoid round-off error. This is particularly important when adding values with different exponents, since the mantissas must be offset by up to 128 bits before addition, thus producing a sum with up to 24 + 128 = 152 bits. The result is trimmed back to 24 bits after normalizing the result.

What is the largest positive value that can be represented in 32-bit IEEE floating point? Start by taking the largest positive mantissa and the largest positive exponent:

+1.11111111111111111111111

_{2}* 2^{+128}However, any value where E=255 (exponent = 128) is a special case representing NaN or infinity, so we must take the second largest exponent:

+1.11111111111111111111111

_{2}* 2^{+127}0.11111111111111111111111

_{2}is very close to 1.0, since all of the digits are at their maximum. We can therefore approximate it 0.9999_{10}. So our value is 1.9999_{10}* 2^{+127}= 3.4028 * 10^{38}.If we want the exact value of 0.11111111111111111111111, we can note that it is equal to 1.0 - the least significant bit:

0.11111111111111111111111 2^-1 + ... + 2^-23 + 0.00000000000000000000001 2^-23 ----------------------------- 1.00000000000000000000000

What is the smallest positive value that can be represented in 32-bit IEEE floating point? Start by taking the smallest positive mantissa and the smallest negative exponent:

1.00000000000000000000000

_{2}* 2^{-127}However, since this value is a special case in IEEE format used to represent 0, we take the second smallest mantissa:

1.00000000000000000000001

_{2}* 2^{-127}= 1.1754 * 10^{-38}Precision is 24 bits (23-bit F + the assumed 1), or around 6-7 decimal digits. The sign bit is separate from the mantissa, so we don't need to deduct it from the precision.

What is the decimal value of the following 32-bit IEEE number?

1 10000001 10100000000000000000000

S = 1 E = 10000001_2 = 129 - 127 = 2 F = 101_2 M = 1.101_2 = 1.625_10 Value = -1.625 * 2^2

How do we represent +4.25 in 32-bit IEEE?

4.25_10 = 100.01_2 = 1.0001_2 * 2^2 E = 2 + 127 = 129 = 10000001_2 F = 0011 S = 0 0 10000001 00010000000000000000000

The 64-bit format is similar to the 32-bit format, but uses an 11-bit bias-1023 exponent, and a 53-bit fixed-point mantissa in the form 1.x, where only the fractional portion (52 bits) is stored.

The range of the 64-bit format is +/-
(2.2250 * 10^{-308} to
1.7976 * 10^{308})

Define each of the following terms with respect to floating point:

Mantissa

Exponent

Radix

Normalized

Overflow

Underflow

What is the most common radix for floating point number systems? Why?

A 32-bit floating point system uses a 25-bit two's complement mantissa and a 7-bit bias-63 exponent, and a radix of 2.

What is the largest possible positive number?

What is the smallest possible positive number?

Show the value 15.25_10 in this floating point format.

What is the value of the following number in this floating point format?

1111111111111111111111101 1000011

Describe two costs of using floating point in place of integers.

Show the decimal value of each of the following 32-bit IEEE floating point numbers:

0 01111100 10000000000000000000000

1 10000010 11000000000000000000000

Convert the following decimal values to 32-bit IEEE floating point.

-6.125_10

+0.75_10