# IEEE 754 summary

2020-05-23 12:30

IEEE 754 is the standard that defines the float and doubles we use in most programming languages. I had to deal with these quite a bit during my work, so I wanted make down a quick summary of what it specifies.

I mostly deal with 32-bit floating-point, aka floats, and 64-bit floating-point, aka doubles.

Both are made up of 3 components:

- sign (1 bit)
- exponent (8 or 11 bits)
- significand (23 or 53 bits)

There are broadly 3 categories of floating-point:

- finite numbers, including +0.0 and -0.0
- +infinity, -infinity
- NaNs

Infinities have all bits in the exponent set and significand 0, i.e. there are only 2 possible bit patterns for infinities:

- +infinity ==
`0x7f80 0000`

- -infinity ==
`0xff80 0000`

NaNs have all bits in the exponent set and significand non-zero, i.e. there are 2^24 - 1 or 2^54 - 1 (sign + significand - infinities) possible bit patterns.

For example:

NaNs:

`0x7fc0 0001`

`0x7f80 0001`

`0xffc0 0001`

`0xff80 0001`

- etc.

For some architectures, the NaNs are canonicalized. For example, ARM uses the default NaN of `0x7fc0 0000`

.

For many operations, the behavior depends on the categories:

- -0.0 and +0.0 are not distinguished by comparisons
- +infinity is larger than every finite number, and -infinity
- -infinity is smaller than every finite number, and +infinity
- NaNs compare unordered with everything

This list should give a reasonable understanding of how arithmetic and comparison operations behave when given different floating-point operands.

## Min/max

This is where it gets a bit tricky. Different architectures implement this slightly differently.

The minps on x86 systems does `minps(NaN,0.0) = 0.0`

.

Whereas on ARM, the vmin does `vmin(NaN, 0.0) = NaN`

Similarly for zeroes,

`maxps(+0.0, -0.0) = -0.0`

, but`vmax(+0.0, -0.0) = +0.0`

.

This difference across platforms is one reason why `f32x4.min`

and `f64x2.min`

has such asymmetric codegen counts.

What does Cpp do in this case? We refer to the implementation of `std::min`

.

```
template<class T>
const T& min(const T& a, const T& b)
{
return (b < a) ? b : a;
}
```

`std::min(NaN, 0.0) = NaN`

, since `b < a == false`

as NaN compares unordered, not less than.

And this is what a recent proposal is introducing to WebAssembly SIMD.

## Rounding

I think of rounding in terms of the C functions: `ceil`

, `floor`

, `trunc`

, `rint`

, `round`

. They are referred to differently in IEEE 754.

- Round towards +infinity ==
`ceil`

- Round towards -infinity ==
`floor`

- Round towards zero ==
`truncate`

- Round to nearest, ties to even ==
`rint`

or`nearbyint`

- Round to nearest, ties away from zero ==
`round`