Computer Arithmetic - Floating Point
How do computers represent numbers like 3.14159 or 0.000001? The answer is floating-point representation, which allows computers to work with very large and very small numbers using a fixed number of bits.
Floating-Point Representation
IEEE 754 Floating-Point Format
+---------------------------------------------+
| |
| Single Precision (32-bit): |
| +---+---+-----------------------------+ |
| | S | Exponent | Mantissa | |
| |1bit| 8 bits | 23 bits | |
| +---+---+-----------------------------+ |
| |
| Double Precision (64-bit): |
| +---+---+-----------------------------+ |
| | S | Exponent | Mantissa | |
| |1bit| 11 bits | 52 bits | |
| +---+---+-----------------------------+ |
| |
| Value = (-1)^S ร 1.Mantissa ร 2^(Exp-127)|
| |
| Example: 6.75 in single precision |
| - Sign: 0 (positive) |
| - Binary: 110.11 |
| - Normalized: 1.1011 ร 2^2 |
| - Exponent: 2 + 127 = 129 = 10000001 |
| - Result: 0 10000001 10110000... |
+---------------------------------------------+
Special Values
- Zero: Exponent all 0s, Mantissa all 0s (+0 or -0)
- Infinity: Exponent all 1s, Mantissa all 0s (+โ or -โ)
- NaN (Not a Number): Exponent all 1s, Mantissa non-zero (0/0, โ-โ, etc.)
- Denormalized Numbers: Exponent all 0s, Mantissa non-zero (very small numbers)
Floating-Point Operations
Floating-Point Addition
+---------------------------------------------+
| |
| Example: 1.5 + 0.75 |
| |
| Step 1: Align exponents |
| 1.5 = 1.1 ร 2^0 |
| 0.75 = 1.1 ร 2^(-1) |
| 0.75 = 0.11 ร 2^0 (shifted) |
| |
| Step 2: Add mantissas |
| 1.1 + 0.11 = 10.01 |
| |
| Step 3: Normalize result |
| 10.01 = 1.001 ร 2^1 |
| |
| Step 4: Round if necessary |
| Result: 2.25 โ |
+---------------------------------------------+
Precision Issues
Floating-point arithmetic has inherent limitations:
- Rounding Errors: 0.1 + 0.2 โ 0.3 exactly in binary!
- Cancellation: Subtracting nearly equal numbers loses precision
- Order of Operations: (a + b) + c โ a + (b + c) due to rounding
- Overflow/Underflow: Numbers too large or too small to represent
The Classic 0.1 + 0.2 Problem
+---------------------------------------------+
| |
| 0.1 in binary: 0.0001100110011... |
| 0.2 in binary: 0.0011001100110... |
| |
| 0.1 + 0.2 = 0.30000000000000004 |
| (not exactly 0.3!) |
| |
| This is why floating-point comparisons |
| should use epsilon tolerance: |
| if (abs(a - b) < epsilon) then equal |
+---------------------------------------------+