Basics: floating point numbers

Basics

Floating point numbers

MMIX can proceed not only integer but floating point numbers also. The well-known IEEE (Institute of Electrical and Electronics Engineers) Standard 754 is used for this purpose.

Note that the same floting point standard is used in all Intel FP co-processors, so you can boldly use any of numerous resources about this topic.

There are some basic ideas to store floating point numbers:

every number is presented by 2 parts called fraction (or mantissa) and exponent (or order); the first one stores all significant digits of the number and the last one shows the position of floating point
among many possible representations there are one dedicated form called normal; all numbers are always stored in normal form if possible; it will be shown later that the first binary digit in normal form must always be 1 - so computer may not store this bit in memory (hidden bit)
every decimal number (if it's not too large) can be converted one-for-one into binary floating point number; because this procedure is approximate, two near numbers may generate exactly the same value (you must always keep this fact in your mind trying to compare floating point numbers!)
situation when large numbers can't be presented with available binary digits is called overflow; an opposite case, when small number is so close to 0 that the difference can't be seen in available binary exponent, is called underflow; as you feel, the last case is less tragic
not every binary code corresponds to floating point number: for example, some special combinations, called NaN ("Not-a-Number"), denote infinity and some other specific float results
several floating point formats may exist according to the number of bits in mantissa and order

Let's discuss some details. The exponent representation of numbers is widely used in many branches of science to write very large or very small numbers. For example, the mass of electron is 9.11*10^-31 kg, the constant value, equal to the number of atoms in a gram mole of any chemical substance, - 6.02*10²³ mol^-1, etc. Such way to represent numbers has a special name - scientific notation. It's evident, that every number has many representations in this notation, for example:
6.02*10²³ = 60.2*10²² = 0.602*10²⁴ = ... To determine some single form, the following conditions for mantissa M are used:
1/R <= M < 1, where R is radix (10 for people and 2 for computer) Numbers, determined by these rules, are called normalized. The only one normalized form for our above example is shown by green font color.

It's curious, that 0.0 can't be normalized! Note also, that +0.0 and -0.0 has different binary codes - this is one of the reasons to use special compare instruction for float data.

Normalization is aimed on saving maximum significant digits in the fixed number of bits. It's very important that for R = 2 (binary numbers) M >= 1/2 so its first digit must be 1! This consistent pattern allows computer not to store this leftmost bit in memory, but save one more bit of mantissa to enlarge numbers precision. Such method is named hidden bit. By the way exactly the number of bits in mantissa determines a calculation precision.

The number of bits in exponent has a great influence with numeric range of computer. For instance with 11-bit order it lies approximately between 10^-308 and 10³⁰⁸. "Denormal numbers" makes this range wider (from 10^-324 to 10³²⁴) but with fewer bits of precision.

IEEE standard also defines some special codes with all exponent's bits set to 1 - so called NaNs. They include positive and negative infinity, undetermined value and some other specific values. We'll not discuss this material in details just mention once more that not every binary code is correct float number (see also the table with examples below).

Standard bit assignment for floating point numbers in MMIX is the following:

The leftmost bit of 8-byte MMIX data code is the sign of a floating point number (0 - positive, 1 - negative). Next 11 bits means exponent, and the last 52 bits form mantissa.

Please note that for IEEE standard such 64-bit representation was called "double", but for MMIX it's the usual floating data format. MMIX also supports "short" 32-bit float format, which was in IEEE standard called "single". Don't be tangled!

Unlike integer numbers, float mantissa is always stored as positive value. Number's exponent is also positive, being calculated by formula:

e = o + 3FF₁₆ where stored exponent is designated as e and o is factual number's order (can be negative!). The additional constant value is usually called bias.

And now some characteristic examples of MMIX floating coding:

0.5 3FE 0000000000000

1.0 3FF 0000000000000

2.0 400 0000000000000

4.0 401 0000000000000

8.0 402 0000000000000

10.0 402 4000000000000

100.0 405 9000000000000

1 000.0 40C 3880000000000

1 000 000.0 412 E848000000000

0.000 001 3EB 0C6F7A0B5ED8D

-1.0 BFF 0000000000000

-10.0 C02 4000000000000

+0.0 000 0000000000000

-0.0 800 0000000000000

maximum normalized (+) 7FE FFFFFFFFFFFFF

minimum normalized (-) FFE FFFFFFFFFFFFF

positive infinity 7FF 0000000000000

negative infinity FFF 0000000000000

undetermined value FFF 8000000000000

one of SNaN values FFF 7100000000000

one of QNaN values FFF 8100000000000

one of denormal numbers 000 FFFFFFFFFFFFF

Related topics:

"MMIX basics" page