Computer Arithmetic - Floating Point

How computers represent and compute with real numbers

Computer Arithmetic - Floating Point

How do computers represent numbers like 3.14159 or 0.000001? The answer is floating-point representation, which allows computers to work with very large and very small numbers using a fixed number of bits.

Floating-Point Representation


    IEEE 754 Floating-Point Format
    +---------------------------------------------+
    |                                             |
    |  Single Precision (32-bit):                |
    |  +---+---+-----------------------------+  |
    |  | S | Exponent |    Mantissa          |  |
    |  |1bit| 8 bits  |    23 bits           |  |
    |  +---+---+-----------------------------+  |
    |                                             |
    |  Double Precision (64-bit):                |
    |  +---+---+-----------------------------+  |
    |  | S | Exponent |    Mantissa          |  |
    |  |1bit| 11 bits |    52 bits           |  |
    |  +---+---+-----------------------------+  |
    |                                             |
    |  Value = (-1)^S × 1.Mantissa × 2^(Exp-127)|
    |                                             |
    |  Example: 6.75 in single precision         |
    |  - Sign: 0 (positive)                     |
    |  - Binary: 110.11                          |
    |  - Normalized: 1.1011 × 2^2               |
    |  - Exponent: 2 + 127 = 129 = 10000001    |
    |  - Result: 0 10000001 10110000...         |
    +---------------------------------------------+

Special Values

Zero: Exponent all 0s, Mantissa all 0s (+0 or -0)
Infinity: Exponent all 1s, Mantissa all 0s (+∞ or -∞)
NaN (Not a Number): Exponent all 1s, Mantissa non-zero (0/0, ∞-∞, etc.)
Denormalized Numbers: Exponent all 0s, Mantissa non-zero (very small numbers)

Floating-Point Operations


    Floating-Point Addition
    +---------------------------------------------+
    |                                             |
    |  Example: 1.5 + 0.75                       |
    |                                             |
    |  Step 1: Align exponents                   |
    |  1.5  = 1.1 × 2^0                        |
    |  0.75 = 1.1 × 2^(-1)                     |
    |  0.75 = 0.11 × 2^0  (shifted)            |
    |                                             |
    |  Step 2: Add mantissas                     |
    |  1.1 + 0.11 = 10.01                       |
    |                                             |
    |  Step 3: Normalize result                  |
    |  10.01 = 1.001 × 2^1                      |
    |                                             |
    |  Step 4: Round if necessary                |
    |  Result: 2.25 ✓                            |
    +---------------------------------------------+

Precision Issues

Floating-point arithmetic has inherent limitations:

Rounding Errors: 0.1 + 0.2 ≠ 0.3 exactly in binary!
Cancellation: Subtracting nearly equal numbers loses precision
Order of Operations: (a + b) + c ≠ a + (b + c) due to rounding
Overflow/Underflow: Numbers too large or too small to represent


    The Classic 0.1 + 0.2 Problem
    +---------------------------------------------+
    |                                             |
    |  0.1 in binary: 0.0001100110011...         |
    |  0.2 in binary: 0.0011001100110...         |
    |                                             |
    |  0.1 + 0.2 = 0.30000000000000004          |
    |              (not exactly 0.3!)            |
    |                                             |
    |  This is why floating-point comparisons    |
    |  should use epsilon tolerance:             |
    |  if (abs(a - b) < epsilon) then equal    |
    +---------------------------------------------+

🧪 Quick Quiz

What is the IEEE 754 standard used for?

← Previous Quantum Computing Basics