IEEE 754 summary

Home

IEEE 754 is the standard that defines the float and doubles we use in most programming languages. I had to deal with these quite a bit during my work, so I wanted make down a quick summary of what it specifies.

I mostly deal with 32-bit floating-point, aka floats, and 64-bit floating-point, aka doubles.

Both are made up of 3 components:

sign (1 bit)
exponent (8 or 11 bits)
significand (23 or 53 bits)

There are broadly 3 categories of floating-point:

finite numbers, including +0.0 and -0.0
+infinity, -infinity
NaNs

Infinities have all bits in the exponent set and significand 0, i.e. there are only 2 possible bit patterns for infinities:

+infinity == 0x7f80 0000
-infinity == 0xff80 0000

NaNs have all bits in the exponent set and significand non-zero, i.e. there are 2^24 - 1 or 2^54 - 1 (sign + significand - infinities) possible bit patterns.

For example:

NaNs:

0x7fc0 0001
0x7f80 0001
0xffc0 0001
0xff80 0001
etc.

For some architectures, the NaNs are canonicalized. For example, ARM uses the default NaN of 0x7fc0 0000.

For many operations, the behavior depends on the categories:

-0.0 and +0.0 are not distinguished by comparisons
+infinity is larger than every finite number, and -infinity
-infinity is smaller than every finite number, and +infinity
NaNs compare unordered with everything

This list should give a reasonable understanding of how arithmetic and comparison operations behave when given different floating-point operands.

Min/max

This is where it gets a bit tricky. Different architectures implement this slightly differently.

The minps on x86 systems does minps(NaN,0.0) = 0.0.

Whereas on ARM, the vmin does vmin(NaN, 0.0) = NaN

Similarly for zeroes,

maxps(+0.0, -0.0) = -0.0, but
vmax(+0.0, -0.0) = +0.0.

This difference across platforms is one reason why f32x4.min and f64x2.min has such asymmetric codegen counts.

What does Cpp do in this case? We refer to the implementation of std::min.

template<class T>
const T& min(const T& a, const T& b)
{
        return (b < a) ? b : a;
}

std::min(NaN, 0.0) = NaN, since b < a == false as NaN compares unordered, not less than.

And this is what a recent proposal is introducing to WebAssembly SIMD.

Rounding

I think of rounding in terms of the C functions: ceil, floor, trunc, rint, round. They are referred to differently in IEEE 754.

Round towards +infinity == ceil
Round towards -infinity == floor
Round towards zero == truncate
Round to nearest, ties to even == rint or nearbyint
Round to nearest, ties away from zero == round

IEEE 754 summary

Min/max

Rounding

Links