REPRESENTATION

Floating-point representation is a method used by computers to store real numbers in a standardized, efficient format defined by the IEEE-754 specification. Instead of storing a number exactly as written, floating-point values are encoded using three components: a sign bit, an exponent, and a mantissa (or significand). The number is first normalized into scientific notation of the form 1.xxxxx × 2^e, after which the exponent is adjusted using a bias and the fractional part of the significand is stored without its implicit leading 1. This structure allows a wide range of real numbers to be represented with limited memory, but it also introduces rounding limitations and precision constraints. Understanding floating-point representation is essential for recognizing how numerical values are stored, why certain numbers cannot be represented exactly, and how precision errors arise in computations.

The IEEE 754 standard, established in 1985, defines how floating-point numbers are represented, stored, and calculated in digital computers. It outlines the format for floating-point numbers, as well as rules for rounding, exceptions, and handling edge cases.

Floating-point numbers are represented using three components:

  1. Sign Bit: Indicates whether the number is positive or negative.

  2. Exponent: Determines the scale of the number, stored in a biased format.

  3. Mantissa (Significand): Represents the precision of the number, typically normalized.

FORMATS

SINGLE PRECISION (32-BIT)

Stored in 32 bits: 1 sign bit, 8 exponent bits, and 23 fraction (mantissa) bits.
Format:
 S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF

 * S: 1 bit for the sign (0 for positive, 1 for negative)
 * E: 8 bits for the exponent
 * F: 23 bits for the fractional part (mantissa)
 
 * The sign bit indicates whether the number is positive (0) or negative (1). 
 * The exponent represents the power of 2 by which the mantissa should be multiplied. 
 * The fraction (or mantissa) represents the precision of the floating-point number, 
   and typically, it is normalized, meaning the leading bit is assumed to be 
   1 (in binary), so it's not explicitly stored.

DOUBLE PRECISION (64-BIT)

1 bit for sign, 11 bits for exponent, 52 bits for mantissa.

Stored in 64 bits: 1 sign bit, 11 exponent bits, and 52 fraction bits.
Format:
 S EEEEEEEEEEE FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

 * S: 1 bit for the sign (0 for positive, 1 for negative)
 * E: 11 bits for the exponent
 * F: 52 bits for the fractional part (mantissa)
 
 * The sign bit indicates whether the number is positive (0) or negative (1). 
 * The exponent represents the power of 2 by which the mantissa should be multiplied. 
 * The fraction (or mantissa) represents the precision of the floating-point number, 
   and typically, it is normalized, meaning the leading bit is assumed to be 
   1 (in binary), so it's not explicitly stored.

EXTENDED PRECISION (80-BIT OR 128-BIT)

Higher precision formats for specific applications.

80-BIT FORMAT

This format is mostly used by the x87 FPU. The 80-bit extended-precision format stores the leading integer bit explicitly, unlike 32-bit and 64-bit formats where it is implicit. The significand field is 64 bits in total: 1 explicit integer bit and 63 fractional bits.

Stored in 80 bits (common format) or higher (e.g., 128-bit for even greater precision).
Format (80-bit):
 S EEEEEEEEEEEE FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

 * S: 1 bit for the sign (0 for positive, 1 for negative)
 * E: 15 bits for the exponent
 * Integer Bit: 1 explicit leading bit
 * F: 63 bits for the fractional part
   ------------------------------------
   = 80 bits total

 * The sign bit indicates whether the number is positive or negative.
 * The exponent determines the power of 2 by which the significand is scaled. It uses 
   a bias of 16383.
 * The mantissa (significand) is normalized, but unlike other IEEE-754 formats, the 
   leading bit is stored explicitly rather than assumed. This allows the 80-bit format 
   to provide one additional bit of precision.

128-BIT FORMAT

Stored in 128 bits: 1 sign bit, 15 exponent bits, and 112 mantissa (fraction) bits.
Format:
 S EEEEEEEEEEEEE FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

 * S: 1 bit for the sign (0 for positive, 1 for negative).
 * E: 15 bits for the exponent.
 * F: 112 bits for the fractional part (mantissa).

 * The sign bit indicates whether the number is positive (0) or negative (1).
 * The exponent represents the power of 2 by which the mantissa should be multiplied. 
   The exponent is typically stored with a bias (e.g., 16383 for 15-bit exponent in 
   128-bit format), allowing both positive and negative exponents.
 * The mantissa (or fraction) represents the precision of the floating-point number. 
   The mantissa is normalized, meaning the leading bit (which is always 1 in binary 
   for normalized numbers) is assumed and not stored explicitly, maximizing the use 
   of the 112 bits for precision.

 * Key Features of 128-bit Extended Precision:
    - Precision: Provides extremely high precision with 112 bits for the mantissa, 
      offering up to 34 decimal digits of accuracy.
    - Range: The 15-bit exponent allows for a very wide range of representable values, 
      both extremely large and small, which is ideal for scientific, engineering, and 
      financial computations requiring very high precision.

PROCESS

This is the procedure for how floating-point numbers are stored in memory using the IEEE-754 single-precision (32-bit) format, using the decimal value 10 as an example.

STEP 1: NORMALIZATION

Normalize the number into scientific notation of the form 1. 𝑥𝑥𝑥𝑥𝑥 × 2^𝑒.

#Convert the decimal number to binary
Decimal: 10
Binary: 1010.0
 * add a decimal point to the end of the binary bits
 
#Move the decimal point 3 places to the left to get the normalized form
Move 1: 101.0
Move 2: 10.10
Move 3: 1.010
Scientific Notation: 1.010base2 x 2^3

STEP 2: APPLY EXPONENT BIAS

  • For 32-bit single-precision, the exponent bias is 127 (with an 8-bit exponent).

  • For 64-bit double-precision, the exponent bias is 1023 (with an 11-bit exponent).

  • For 80-bit extended precision, the exponent bias is 16383 (with a 15-bit exponent).

#FORMULA: E = e + bias
 = 3 + 127 
 = 130

#Convert to decimal to binary
Decimal: 130
Binary: 10000010

STEP 3: DETERMINE THE MANTISSA (SIGNIFICAND)

IEEE-754 does not store the leading 1 for normalized numbers (it’s implicit and not stored). So store only the fractional part after the point

Normalized: 1.010
Fractional bits: 010
Concatenate: 010 + 00000000000000000000
Significand (23 bits): 01000000000000000000000
 * IEEE single precision requires 23 bits for the mantissa field. 
    - pad the fractional bits "010" with zeros until 23 total bits is reached

STEP 4: ASSEMBLE

Assemble the entirety for storage or representation

#FORMULA: sign | exponent (8) | mantissa (23)
#identify sign bit: 0 (positive) 1 (negative)
Binary: 10 is a positive number hence the sign bit is 0
Exponent: 10000010
 * retrieved from step 2
Significand: 010 + 00000000000000000000
 * retrieved from step 3
 
Full 32-bit binary: 0 10000010 01000000000000000000000
IEEE-754 single-precision (32-bit) for 10.0: 01000001001000000000000000000000

STEP 5: CONVERT TO HEXADECIMAL

This step is optional and is only for readability

Binary Representation: 01000001001000000000000000000000
Grouping: 0100 0001 0010 0000 0000 0000 0000 0000
           4    1    2    0    0    0    0    0
 * Group into 4-bit nibbles from the left

Hexadecimal representation: 0x41200000

Last updated