Memory alignment requirements for SSE and AVX instructions

The short:

SSE instructions require aligned memory operands, except for instructions that say unaligned
AVX instructions do not require aligned memory operands, except for instructions that say aligned

Read on for more.

Memory and alignment

SIMD instructions on x86 systems can take memory operands. However, whenever memory is accessed, we have to consider alignment.

The natural alignment is the size of the memory access - accessing 8 bytes requires the access to be aligned to 8 bytes boundary. Proper alignment ensures that the access will not cross cache lines, leading to performance reduction.

Imagine a 8-byte cache line. Loading 8 bytes on any 8-byte aligned address loads a full cache line. However, if the address is only 4-byte aligned, you will require loading 2 cache lines, effectively halving the speed. This is especially a problem on older processors.

SIMD, SSE, AVX

For SIMD instructions on x86, this usually means 128-bit, 256-bit, or 512-bit alignment. For simplicity we only discuss 128-bit SIMD, available since SSE.

Take a look at an instruction like addps. addps is the SSSE3 version, and vaddps is the AVX (or VEX-encoded version). Both instruction can take a m128 source operand, meaning that 128 bits will be loaded from memory. And as we have discussed, this address should be 128-bit aligned.

The key different in memory alignment requirements between SSE and AVX instructions is that for SSE, you get a segfault; and for AVX, it works, with a potential performance hit (due to cache lines).

This means that when using any SSE instructions (referred to as 128-bit Legacy SSE instruction), all memory operands must be aligned. When using the AVX version (referred to as VEX.128), which also works on 128-bit operands, unaligned memory operands are supported, no segfaults.

There are other instructions that explicitly require alignment, or does not require alignment - these instructions usually have “aligned” or “unaligned” in their names. For example movaps requires aligned memory operands, both SSE and AVX versions, whereas movups does not require any alignment at all.

So, which instruction to use? When in doubt, use movups. It handles all cases, aligned and unaligned. Of course, there is the issue of potential performance drop.