Home

IEEE 754 is the standard that defines the float and doubles we use in most programming languages. I had to deal with these quite a bit during my work, so I wanted make down a quick summary of what it specifies.

I mostly deal with 32-bit floating-point, aka floats, and 64-bit floating-point, aka doubles.

Both are made up of 3 components:

There are broadly 3 categories of floating-point:

Infinities have all bits in the exponent set and significand 0, i.e. there are only 2 possible bit patterns for infinities:

NaNs have all bits in the exponent set and significand non-zero, i.e. there are 2^24 - 1 or 2^54 - 1 (sign + significand - infinities) possible bit patterns.

For example:

NaNs:

For some architectures, the NaNs are canonicalized. For example, ARM uses the default NaN of 0x7fc0 0000.

For many operations, the behavior depends on the categories:

This list should give a reasonable understanding of how arithmetic and comparison operations behave when given different floating-point operands.

Min/max

This is where it gets a bit tricky. Different architectures implement this slightly differently.

The minps on x86 systems does minps(NaN,0.0) = 0.0.

Whereas on ARM, the vmin does vmin(NaN, 0.0) = NaN

Similarly for zeroes,

This difference across platforms is one reason why f32x4.min and f64x2.min has such asymmetric codegen counts.

What does Cpp do in this case? We refer to the implementation of std::min.

template<class T>
const T& min(const T& a, const T& b)
{
        return (b < a) ? b : a;
}

std::min(NaN, 0.0) = NaN, since b < a == false as NaN compares unordered, not less than.

And this is what a recent proposal is introducing to WebAssembly SIMD.

Rounding

I think of rounding in terms of the C functions: ceil, floor, trunc, rint, round. They are referred to differently in IEEE 754.