Floating Point

The Basics

Integer number formats cannot store fractional values. We can get around this using a fixed point format, but this severely limits both the range and the number of fractional digits. A 64-bit fixed point system with 32 whole number digits and 32 fractional digits has a maximum range of 232 - 1 even when we don't need the fractional bits. A system that utilizes all the bits where they are needed would generally be a better option for approximating real numbers.

Floating point gets around the limitations of fixed point by using a format similar to scientific notation.

3.52 * 103 = 3520

A scientific notation number consists of a mantissa (3.52 in the example above) a radix (always 10), and an exponent (3 in the example above). Hence, the general format of a scientific notation value is:

mantissa * radixexponent

The normalized form always has a mantissa greater than or equal to 1.0, and strictly less than 10.0. Written mathematically, the mantissa is in the set [1.0,10.0). As you may recall from algebra, [] encloses a set including the extreme values and () encloses a set excluding them. We can denormalize the value and express it in many other ways, such as 35.2 * 102, or 0.00325 x 106. For each position we shift the digits of the mantissa relative to the decimal point, we are multiplying or dividing the mantissa by a factor of 10. To compensate for this, we simply increase or decrease the exponent. Adding 1 to the exponent multiplies 10exponent by 10, for example. this compensates for shifting the mantissa digits one position to the right, which divides the mantissa by 10.

Denormalizing is necessary when adding scientific notation values with different exponents:

    3.52 * 10^3
+   1.97 * 10^5

    0.0352 * 10^5
+   1.97 * 10^5
    2.0052 * 10^5

Adjusting the mantissa and exponent is also sometimes necessary to normalize results. For example, if we had chosen to denormalize the other value above, we would see the following:

    3.52 * 10^3
+   1.97 * 10^5

      3.52 * 10^3
+   197.0 * 10^3 
    200.52 * 10^3   Not normalized
    2.0052 * 10^5   Normalized

For this reason, it's best to denormalize the number with the smaller exponent. This way, the result of the addition is likely to be already normalized.

A binary floating system is basically the same, but stores a signed binary mantissa and a signed binary exponent, and usually uses a radix of 2. Using a radix of 2 (or any power of 2) allows us to normalize and denormalize by simply shifting the binary digits in the mantissa and adding or subtracting from the exponent, just as we do with scientific notation. Shifting binary digits in the mantissa n bits to the left or right multiplies or divides the mantissa by 2n, just as shifting decimal digits multiplies or divides by 10.

000102 * 23 = 010002 * 21.

Standard floating point formats are defined by the IEEE society. The IEEE formats are slightly more complex than necessary to understand floating point in general, so we will start with a simpler example here.

Floating Point and Abstraction

Many people get confused about floating point because of their inability to abstract it into a hierarchy of simple components. If we solve one piece of the puzzle at a time and then put them together, it's no more difficult than fixed point or integers. It just requires a simple extra step such as computing mantissa * radix ^ exponent. Don't think about the floating point format overall or the exponent while converting the mantissa. Just understand the mantissa and deal with it separately. This is a classic example of divide and conquer, which is the whole point of abstraction.

A Simple Floating Point Format

Suppose a 32-bit floating point format has a 24-bit two's complement mantissa, an 8-bit two's complement exponent, and a radix of 2. The general structure is:

mantissa * 2exponent

The binary format is as follows, labeling bits starting with 0 on the right as we would with an unsigned integer:

 31                  8 7        0
|       mantissa      | exponent |

We do not need to store the radix, since it never changes and can just be assumed by the hardware design.

  1. What is the value of the following number?

    000000000000000000010010 11111100

    The mantissa is 000000000000000000010010, or +(2 + 16) = +1810.

    The exponent is 11111100 = -(00000011 + 1) = -00000100 = -410.

    The value is therefore +18 * 2-4, or +1.125_10.

  2. What is the largest positive value we can represent in this system?

    The largest positive value will consist of the largest positive mantissa and the largest positive exponent. Since we know the range of any two's complement system from the section called “Two's Complement”, this is easy to figure out.

    The largest positive mantissa is 011111111111111111111111, which in two's complement is +223-1 = +8388607. The largest exponent is 01111111, which in two's complement is +27-1 (+127).

    Hence, the largest positive value is +8388607 * 2+127 = 1.42 * 1045.

    Note that this is a much larger number than we can store in 1 32-bit integer. But with 32 bits, there are exactly 232 different patterns of 0s and 1s possible, so we cannot represent more than 232 different numbers.

    The obvious catch, then, is that we cannot store every integer value up to 1.42 * 1045. So what is the second largest positive value? What is the difference between the largest and second largest? This would be the second largest mantissa and the largest exponent, so +8388606 * 2+127. This value is equal to the largest value minus 2127. As we can see, the larger the exponent, the more spread out are the consecutive values that our system can represent.

  3. What is the smallest positive value? For an integer system, this is always 1. To find the smallest positive value in the form mantissa * radixexponent, we choose the smallest positive mantissa, and the smallest negative exponent (the negative exponent with the largest magnitude).

    Since the mantissa is an integer, the smallest positive value possible is 1.

    Since the exponent is an 8-bit two's complement value, the smallest negative exponent is 100000002, of -27 = -128.

    Hence the smallest positive value is 1 x 2-128, or 2.93873587706 x 10-39. The next smallest would be this value + 10-39. So we can see that consecutive values close to 0 are very close together.

  4. What is the precision of the system?

    The precision of any system is the maximum number of significant figures, and is determined by the mantissa since the exponent only adds leading or trailing 0s to the value (i.e. shifts the digits in the mantissa left or right). The precision is therefore 23 bits, which is the number of bits that contribute to the magnitude of a two's complement integer.


    A 32-bit integer is significantly more precise than a 32-bit floating point value, since floating point sacrifices some bits of precision to increase range and allow for fractional values.
  5. Represent -2.75 in this floating point system.

    1. Convert the number to fixed point binary using the methods described in previous sections:

      -2.7510 = -(10.112)

    2. Multiply by radixexponent equal to 1 so we have all the components:

      -2.7510 = -(10.112) * 20

    3. Shift the binary point to make the mantissa a whole number: -(10112)

      By moving the binary point two places to the right, we multiplied the mantissa by 22. We therefore must divide the other factor (2exponent) by 22 in order to preserve the overall value. This is is just a matter of subtracting 2 from the exponent.

      -10.112 * 20 = -(10112) * 2-2

    4. Convert the mantissa and exponent into the specified formats. The mantissa in this case is 24-bit two's complement and the exponent is 8-bit two's complement.

      Mantissa: -(000000000000000000001011) = 111111111111111111110101

      Exponent: -210 = 11111110

    5. Place the mantissa and exponent in their positions in the 32-bit floating point format:

      Binary representation = 11111111111111111111010111111110

Overflow and Underflow

Overflow occurs when the result of a floating point operation is larger than the largest positive value, or smaller than the smallest negative value. In other words, the magnitude is too large to represent.

Underflow occurs when the result of a floating point operation is between the smallest possible positive value and the largest possible negative value. In other words, the magnitude is too small to represent.

The example 32-bit format above cannot represent values larger than 1.42 * 1045 or smaller than 2.93873587706 * 10-39.

One technique to avoid overflow and underflow is to alternate operations that increase and decrease intermediate results. Rather than do all the multiplications (by values greater than 1) first, which could cause overflow, or all the divisions first, which could cause underflow, we could alternate multiplications and divisions to moderate the results along the way. Techniques like these must often be used in scientific calculations.

Using 4-bit unsigned integers as a simpler example of the concept, if we try to compute 4 * 4 / 2, we will encounter an overflow because 4 * 4 is larger than the maximum value of 15 that we can represent in 4-bits. If, on the other hand, we compute it as 4 / 2 * 4, we get the correct result. The same principle applies to all systems with limited range.

Cost of Floating Point

Everything has a cost. The increased range and the ability to represent non-whole numbers is no exception.


There are only 232 possible patterns of 32 0s and 1s. Hence, there are at most 232 unique numbers that we can represent with 32 bits, regardless of the format.

So how is it that we can represent numbers up to 1045?

Obviously, we must be sacrificing something in between. What floating point does for us is spread out the limited number of binary patterns we have available to cover a larger range of numbers. The larger the exponent, the larger the gap between consecutive numbers that we can accurately represent.

The gap between consecutive numbers that we can represent gets larger as the exponent grows.

The precision of a 32-bit floating point value is less than the precision of a 32-bit integer. By using 8 bits for the exponent, we sacrifice those 8 bits of precision. Hence, our example format has the same precision as a 24-bit signed integer system.


Arithmetic on floating point is several times slower than on integers. This is an inherent property of the more complex format.

Consider the process of adding two scientific notation values:

  1. Equalize the exponents
  2. Add the mantissas
  3. Normalize the result

Each of these operations take roughly the same amount of time in a computer as a single integer addition. Since floating point is stored like scientific notation, it uses a similar process and we can expect floating point addition to take about three times as long as integer addition. In reality, a typical PC takes about 2.5 times as long to run a loop doing mostly floating point arithmetic as it does to do the same loop with integers.

Note that this applies only to operations that can be carried out using either a single integer instruction or a single floating point instruction. For example, suppose a program is running on a 32-bit computer, and the data may be outside the range of a 32-bit integer. In this case, multiple integer instructions will be necessary to process integer values of more than 64 bits, and much of the speed advantage of integers is lost.

If hardware has floating point support built-in, then common operations like floating point addition, subtraction, etc. can each be handled by a single instruction. If hardware doesn't have a floating point unit (common in microcontrollers), floating point operations must be handled by software routines. Hence, adding two floating point values will require dozens of integer instructions to complete instead of just one floating point instruction.

Most algorithms can be implemented using integers instead of floating point with a little thought. Use of floating point is often the result of sheer laziness. Don't use floating point just because it's intuitive. Think about whether it is really necessary. The answer is usually no.

Example 14.8. Improving Speed with Integers

The value of PI can be estimated using a Monte Carlo simulation by generating uniform random coordinates across a square "dart board" with an inscribed circle. Since the darts are distributed in a uniform random fashion, the probability of one landing in the circle is the area of the circle divided by the area of the square. This ratio is PI/4.

Figure 14.1. Monte Carlo Dart Board

Monte Carlo Dart Board

After generating many random points, we simply divide the number within the circle by the total to get an estimate of PI/4. The intuitive approach is a use a 1 x 1 square and generate random x and y values between 0 and 1. However, this approach requires the use of floating point. if we instead make the dimensions of the square our maximum integer size (e.g. 232-1) and generate integer x and y values instead, we can eliminate all floating point calculations from the program.

In fact, the integer implementation of this program is more than twice as fast as the floating point implementation. Full details can be found in the Research Computing User's Guide.

Power Consumption

Use of floating point also results in more power consumption. This is not usually noticeable on a desktop PC, but can become a problem on large grids consisting of hundreds of PCs, since the power grid they are attached to may not be designed to provide for their maximum draw. It can also be a problem when running a mobile device on battery while doing intensive computations. Battery life while doing intensive floating point computations could be a small fraction of what it is while reading email, browsing the web, or editing a document.

Difficult to Program

Because floating point suffers from round-off error, it is not safe to directly compare floating point values in a program. For example, the following program will fail due to the fact that 0.1 cannot be represented perfectly with a limited number of digits. Instead of stopping at 1.0, x misses the value by a tiny amount, and an infinite loop occurs.

#include <stdio.h>
#include <sysexits.h>

int     main(int argc,char *argv[])

    double  x;

    for (x = 0.0; x != 1.0; x += 0.1)
        printf("%0.16f\n", x);

    return EX_OK;

Actual output:

IEEE Floating Point Formats

The example floating point systems above are only for general understanding and context and not used in real systems. Use of integer mantissa formats suffers from problems such as multiple representations of the same value, e.g. 1 * 22 = 2 * 21.

The IEEE floating point formats have become standard on most CPUs. They are highly optimized and contain special features, like representations for infinity and not-a-number (NaN).

Table 14.12. IEEE 32-bit Format


S is the sign bit. Like integer formats, 1 indicates a negative number and 0 indicates positive.

F is the fractional part of the mantissa. The non-fractional part is not stored, and always 1, so the actual mantissa is 1.F2. This method provides an extra bit of precision that need not be stored, so we have a 24-bit fixed point mantissa in reality, even though we only store 23 bits. It also ensures that there is only 1 way to represent a given value, unlike the example formats above with integer mantissas.

The 8-bit exponent is stored in bias-127 notation, hence E is an unsigned integer with the value exponent + 127.

The absolute value of the number is 1.F x 2E-127, except for some special cases:

  • If E = 255, and F is non-zero, then the value is NaN. (Not a Number)
  • If E = 255, F = 0 and S = 1, then the value is -infinity.
  • If E = 255, F = 0, and S = 0, then the value is +infinity.
  • If E = 0, and F = 0, then the value is 0.

The floating-point hardware uses more than 23 bits for computation to avoid round-off error. This is particularly important when adding values with different exponents, since the mantissas must be offset by up to 128 bits before addition, thus producing a sum with up to 24 + 128 = 152 bits. The result is trimmed back to 24 bits after normalizing the result.

  1. What is the largest positive value that can be represented in 32-bit IEEE floating point? Start by taking the largest positive mantissa and the largest positive exponent:

    +1.111111111111111111111112 * 2+128

    However, any value where E=255 (exponent = 128) is a special case representing NaN or infinity, so we must take the second largest exponent:

    +1.111111111111111111111112 * 2+127

    0.111111111111111111111112 is very close to 1.0, since all of the digits are at their maximum. We can therefore approximate it 0.999910. So our value is 1.999910 * 2+127 = 3.4028 * 1038.

    If we want the exact value of 0.11111111111111111111111, we can note that it is equal to 1.0 - the least significant bit:

                        0.11111111111111111111111   2^-1 + ... + 2^-23
                    +   0.00000000000000000000001   2^-23
  2. What is the smallest positive value that can be represented in 32-bit IEEE floating point? Start by taking the smallest positive mantissa and the smallest negative exponent:

    1.000000000000000000000002 * 2-127

    However, since this value is a special case in IEEE format used to represent 0, we take the second smallest mantissa:

    1.000000000000000000000012 * 2-127 = 1.1754 * 10-38

  3. Precision is 24 bits (23-bit F + the assumed 1), or around 6-7 decimal digits. The sign bit is separate from the mantissa, so we don't need to deduct it from the precision.

  4. What is the decimal value of the following 32-bit IEEE number?

                    1 10000001 10100000000000000000000
                    S = 1
                    E = 10000001_2 = 129 - 127 = 2
                    F = 101_2
                    M = 1.101_2 = 1.625_10
                    Value = -1.625 * 2^2
  5. How do we represent +4.25 in 32-bit IEEE?

                    4.25_10 = 100.01_2
                            = 1.0001_2 * 2^2
                    E = 2 + 127 = 129 = 10000001_2
                    F = 0011
                    S = 0
                    0 10000001 00010000000000000000000

The 64-bit format is similar to the 32-bit format, but uses an 11-bit bias-1023 exponent, and a 53-bit fixed-point mantissa in the form 1.x, where only the fractional portion (52 bits) is stored.

Table 14.13. IEEE 64-bit Format


The range of the 64-bit format is +/- (2.2250 * 10-308 to 1.7976 * 10308)



Be sure to thoroughly review the instructions in Section 2, “Practice Problem Instructions” before doing the practice problems below.
  1. Define each of the following terms with respect to floating point:

    1. Mantissa

    2. Exponent

    3. Radix

    4. Normalized

    5. Overflow

    6. Underflow

  2. What is the most common radix for floating point number systems? Why?

  3. A 32-bit floating point system uses a 25-bit two's complement mantissa and a 7-bit bias-63 exponent, and a radix of 2.

    1. What is the largest possible positive number?

    2. What is the smallest possible positive number?

    3. Show the value 15.25_10 in this floating point format.

    4. What is the value of the following number in this floating point format?

      1111111111111111111111101 1000011

  4. Describe two costs of using floating point in place of integers.

  5. Show the decimal value of each of the following 32-bit IEEE floating point numbers:

    1. 0 01111100 10000000000000000000000

    2. 1 10000010 11000000000000000000000

  6. Convert the following decimal values to 32-bit IEEE floating point.

    1. -6.125_10

    2. +0.75_10