Floating Point Representation
1. Motivation
Fixed-point representation allocates a fixed number of bits to the integer and fractional parts, limiting both range and precision. Floating-point representation decouples these: it uses a form of scientific notation in binary, allowing a vastly larger range at the cost of variable precision.
2. IEEE 754 Single Precision (32-bit)
Format
An IEEE 754 single-precision number uses 32 bits partitioned as follows:
| 1 bit | 8 bits | 23 bits |
| Sign S | Exponent E | Mantissa/Fraction M |
| Field | Bits | Purpose |
|---|---|---|
| Sign () | 1 | 0 = positive, 1 = negative |
| Exponent () | 8 | Biased exponent: stored as |
| Mantissa () | 23 | Fractional part of the significand; implicit leading 1 |
Decoding the Value
The represented value is:
where denotes the binary number . The leading 1 is implicit — it is not stored but always assumed (for normalised numbers). This is called the hidden bit convention.
Deriving the Range
Normalised numbers have (i.e., and are reserved).
- Minimum exponent:
- Maximum exponent:
Largest positive normalised number:
Smallest positive normalised number:
General range (normalised): approximately to .
Precision
With 23 explicit mantissa bits plus 1 implicit bit, the significand has 24 bits of precision. This gives approximately decimal digits of precision.
Special Values
| Exponent | Mantissa | Meaning |
|---|---|---|
| 0 | 0 | (signed zero) |
| 0 | Denormalised number | |
| 255 | 0 | |
| 255 | NaN (Not a Number) |
Denormalised Numbers
When and , the value is:
Note: the implicit bit is now 0 (not 1), and the exponent is fixed at (not ).
Why denormalised numbers exist: Without them, there is a gap between zero () and the smallest normalised number (). Denormalised numbers fill this gap, providing a smooth gradual underflow. The smallest positive denormalised number is:
Example: Decode IEEE 754 single-precision 0x41280000
Hex:
Binary: 0 10000010 01010000000000000000000
- (positive)
- , so
Value:
Example: Encode in IEEE 754 single precision
- , so
Result: 1 10000001 10100000000000000000000
1 10000001 10100000000000000000000 = 11000000 11010000 00000000 00000000 = C0 D0 00 00
3. Normalisation
A floating-point number is normalised when the leading bit of the significand is 1. In binary, this means the number is expressed in the form:
Procedure for normalising:
- Write the number in binary
- Shift the binary point so that exactly one non-zero digit precedes it
- Count the shifts to determine the exponent
- Express in the form
Example: Normalise
Shift binary point right 3 positions:
The normalised form is .
In IEEE 754: , .
4. Floating-Point Arithmetic and Errors
Why
Theorem. The decimal fraction has no finite binary representation.
Proof. We show requires infinitely many binary fractional digits.
For a number to have a finite representation in base , when reduced to lowest terms , the denominator must divide some power of . Here , and does not divide any power of . Therefore has no finite binary expansion.
When stored in IEEE 754, is approximated by the nearest representable binary value. Similarly for and . Since the approximations introduce rounding errors, the sum of the approximations of and does not exactly equal the approximation of .
>>> 0.1 + 0.2
0.30000000000000004
>>> 0.1 + 0.2 == 0.3
False
Absolute and Relative Error
Given an exact value and an approximate value :
Machine epsilon () is the smallest number such that in floating-point arithmetic. For IEEE 754 single precision, .
Sources of Floating-Point Error
- Representation error: Most decimal fractions have infinite binary expansions
- Rounding error: Arithmetic operations may round the result
- Cancellation error: Subtracting nearly equal numbers loses significant digits
- Accumulation error: Errors compound over many operations
Pitfall Never use == to compare floating-point numbers. Instead, check if
for some tolerance.
5. CIE Simplified 8-Bit Floating Point
Board-specific: CIE (9618) CIE uses a simplified floating-point format:
- 1 sign bit
- 4 exponent bits (excess-8, i.e., bias = 8)
- 3 mantissa bits
Format: S EEEE MMM
Decoding:
Note: CIE uses an explicit leading 0 (not the hidden 1 of IEEE 754).
Range:
- Largest:
- Smallest positive:
Example: Decode CIE 8-bit 0 1010 110
(positive) ,
Value:
Example: Encode in CIE 8-bit format
(truncate to 3 bits)
Result: 1 1011 101
6. Fixed-Point vs Floating-Point Comparison
| Property | Fixed-Point | Floating-Point |
|---|---|---|
| Range | Limited by bit allocation | Very large () |
| Precision | Uniform (constant) | Variable (depends on magnitude) |
| Speed | Faster (integer arithmetic) | Slower (requires FPU) |
| Hardware | Simple | Complex (FPU required) |
| Representation | Simple to understand | Complex (normalisation, special values) |
| Rounding errors | Predictable | Less predictable near boundaries |
| Use case | Financial, embedded systems | Scientific computing, graphics |
Problem Set
Problem 1. Convert to IEEE 754 single precision. Give your answer in binary and hexadecimal.
Hint
Convert to binary, normalise, determine sign, exponent (with bias 127), and mantissa.
Answer
, ,
Binary: 1 10000010 11001000000000000000000
Hex: C1640000
Problem 2. Decode the IEEE 754 single-precision value BF800000 (hex).
Hint
Convert hex to binary, split into sign, exponent, mantissa fields.
Answer
BF800000 = 10111111 10000000 00000000 00000000
, , ,
Value:
Problem 3. What is the IEEE 754 single-precision representation of zero? What about negative zero?
Hint
Check the special values table: , .
Answer
: 0 00000000 00000000000000000000000 = 00000000
: 1 00000000 00000000000000000000000 = 80000000
They compare equal in IEEE 754, but their sign bits differ.
Problem 4. Encode in the CIE 8-bit floating-point format.
Hint
Convert to binary, express as with bias 8.
Answer
, , (truncate)
Result: 0 1100 100
Problem 5. Prove that the decimal number has no finite binary representation.
Hint
Write as a fraction in lowest terms. Apply the same argument used for .
Answer
For a finite binary expansion, the denominator must divide a power of 2 when the fraction is in lowest terms. Since is prime and does not divide any power of , has no finite binary representation.
Problem 6. A system uses 12-bit floating-point: 1 sign bit, 5 exponent bits (excess-15), 6 mantissa bits. Calculate the range and approximate precision.
Hint
Follow the same pattern as IEEE 754 but with different field sizes.
Answer
Assuming hidden bit convention:
- Min exponent:
- Max exponent:
- Largest:
- Smallest normalised:
- Precision: bits decimal digits
Problem 7. Explain the difference between normalised and denormalised numbers in IEEE 754. Why would removing denormalised numbers be problematic?
Hint
Consider what happens as values approach zero without denormalised numbers.
Answer
Normalised numbers use the implicit leading 1 and exponent range . Denormalised numbers use implicit leading 0 and fixed exponent .
Without denormalised numbers, there would be a sudden jump from the smallest normalised number () to zero. Any computation producing a result smaller than this threshold would "flush to zero," losing all precision. Denormalised numbers provide a gradual transition, maintaining relative precision longer.
Problem 8. In IEEE 754 single precision, how many distinct normalised numbers are there? How many denormalised?
Hint
Count the combinations of sign, exponent, and mantissa for each category.
Answer
Normalised: Exponent ranges from 1 to 254 (254 values). Mantissa has values. Sign has 2 values. Total: .
Denormalised: , . Total: .
Problem 9. A programmer computes and in single-precision float, then computes . Explain why the result might be rather than .
Hint
Think about the precision of single-precision float relative to the magnitude of .
Answer
Single precision has approximately 7 decimal digits of precision. When , the smallest representable difference between consecutive floats near is approximately . Since is much smaller than the gap between representable numbers near , rounds to itself. Then .
This is an example of cancellation error combined with limited precision.
Problem 10. Calculate the absolute and relative error when is stored as (6 decimal places).
Hint
Use the formulas for absolute and relative error with and .
Answer
Absolute error:
Relative error:
7. IEEE 754 Double Precision (64-bit)
Format
| 1 bit | 11 bits | 52 bits |
| Sign S | Exponent E | Mantissa M |
| Field | Bits | Purpose |
|---|---|---|
| Sign () | 1 | 0 = positive, 1 = negative |
| Exponent () | 11 | Biased exponent: |
| Mantissa () | 52 | Fractional part; implicit leading 1 |
Decoding
Range
- Largest normalised:
- Smallest normalised:
- Precision: bits decimal digits
Comparison: Single vs Double
| Property | Single (32-bit) | Double (64-bit) |
|---|---|---|
| Sign | 1 bit | 1 bit |
| Exponent | 8 bits (bias 127) | 11 bits (bias 1023) |
| Mantissa | 23 bits | 52 bits |
| Precision | ~7 decimal digits | ~16 decimal digits |
| Max value | ~ | ~ |
| Min normal | ~ | ~ |
| Machine epsilon |
8. Special Values in Detail
Signed Zero
IEEE 754 has both and :
+0: Sign = 0, Exponent = 0, Mantissa = 0-0: Sign = 1, Exponent = 0, Mantissa = 0
They compare equal (+0 == -0 is true), but their sign bits differ. This matters for:
- Division: ,
- Square root:
- Complex arithmetic where the sign of zero indicates direction
Infinity
Represents overflow or division by zero:
- : Sign = 0, Exponent = 255 (all 1s), Mantissa = 0
- : Sign = 1, Exponent = 255, Mantissa = 0
Arithmetic rules:
| Operation | Result |
|---|---|
| (for ) | |
| (for ) | |
| NaN | |
| NaN | |
| NaN |
NaN (Not a Number)
Represents undefined or indeterminate results:
- Sign = 0 or 1 (implementation-dependent)
- Exponent = 255 (all 1s)
- Mantissa 0 (any non-zero mantissa)
NaN is produced by:
| Operation | Result |
|---|---|
| NaN | |
| NaN | |
| NaN | |
| NaN |
Key property: NaN is not equal to anything, including itself.
>>> float('nan') == float('nan')
False
>>> import math
>>> math.isnan(float('nan'))
True
To check for NaN, use math.isnan(x) — never use x == float('nan').
9. Precision Loss Examples
Example 1: Catastrophic Cancellation
Subtracting two nearly equal numbers loses significant digits.
a = 1.0000001
b = 1.0000000
difference = a - b # Expected: 0.0000001 = 1e-7
print(difference) # Output: 1.000000082740371e-07
The result has only 1 significant digit of accuracy despite both inputs having 8. The leading digits cancel out, leaving only the error terms.
Example 2: Accumulation Error
total = 0.0
for _ in range(1000000):
total += 0.1
print(total) # Expected: 100000.0
# Output: 100000.00000000134
Each addition introduces a small rounding error. After a million additions, the errors accumulate to a noticeable discrepancy.
Example 3: Comparing Floats
x = 0.1 + 0.2
y = 0.3
print(x == y) # False
# Correct approach: compare within tolerance
epsilon = 1e-9
print(abs(x - y) < epsilon) # True
10. Common Pitfalls
| Pitfall | Explanation | Solution |
|---|---|---|
Using == for float comparison | Rounding errors mean exact equality rarely holds | Compare with tolerance: abs(a - b) < epsilon |
| Assuming floats are exact | Most decimal fractions have infinite binary expansions | Use decimal.Decimal for financial calculations |
| Subtracting nearly equal numbers | Catastrophic cancellation destroys precision | Rearrange the formula algebraically to avoid subtraction |
| Adding small to large numbers | The small number may be lost due to limited precision | Add small numbers together first, then add to the large number |
Checking x == float('nan') | NaN is not equal to itself by definition | Use math.isnan(x) |
| Ignoring denormalised numbers | Assuming all numbers have 24-bit precision | Denormalised numbers near zero have fewer significant bits |
| Mixing single and double precision | Implicit conversions can lose precision | Be consistent with precision throughout the calculation |
11. Additional Problem Set
Problem 1. Encode the value in IEEE 754 single precision. Give the binary and hexadecimal representation.
Answer
(negative)
Binary: 1 01111110 10000000000000000000000
Hex: 10111111 01000000 00000000 00000000 = BF400000
Problem 2. Decode the IEEE 754 double-precision value 4039000000000000 (hex).
Answer
Hex: 4039000000000000
Binary: 0100000000111001000000000000000000000000000000000000000000000000
- (positive)
- , so
Value:
So the value is .
Problem 3. Explain what happens when you compute 1.0 / 0.0 and 0.0 / 0.0 in IEEE 754. Why
are the results different?
Answer
1.0 / 0.0 produces . This represents mathematical division where a non-zero quantity is
divided by zero — the result tends to infinity.
0.0 / 0.0 produces NaN. This represents an indeterminate form: the limit depends on how both
numerator and denominator approach zero (e.g., but
). Since the result is not uniquely determined, IEEE 754 returns NaN.
The distinction is important because can participate meaningfully in further arithmetic (e.g., ), while NaN propagates through all operations, signalling that the result is invalid.
Problem 4. A programmer writes the following code to compute the quadratic formula. Explain why it may give incorrect results and suggest a fix.
def quadratic(a, b, c):
discriminant = b**2 - 4*a*c
x1 = (-b + discriminant**0.5) / (2*a)
x2 = (-b - discriminant**0.5) / (2*a)
return x1, x2
Answer
Problem: Catastrophic cancellation. When is small compared to , the discriminant is close to . Then , and one of the numerators becomes or . If , then subtracts nearly equal numbers, losing precision.
Fix: Compute one root with the standard formula and the other using the identity :
def quadratic(a, b, c):
discriminant = b**2 - 4*a*c
sqrt_d = discriminant**0.5
if b >= 0:
x1 = (-b - sqrt_d) / (2*a)
else:
x1 = (-b + sqrt_d) / (2*a)
x2 = c / (a * x1)
return x1, x2
By choosing the sign that avoids cancellation in , and computing from the product relationship, both roots maintain full precision.
Problem 5. In IEEE 754 single precision, what is the smallest positive number that, when added to , produces a result different from ? Explain why.
Answer
This is the definition of machine epsilon: the smallest such that .
In single precision, the mantissa has 23 bits. The value is represented as . The next representable number is , where the last bit of the mantissa is 1.
This value is .
So .
Any value smaller than , when added to , rounds back to because there are not enough mantissa bits to represent the difference. For example, in single precision.