Memory alignment requirements for SSE and AVX instructions
2020-10-29
The short:
- SSE instructions require aligned memory operands, except for instructions that say unaligned
- AVX instructions do not require aligned memory operands, except for instructions that say aligned
Read on for more.
Memory and alignment
SIMD instructions on x86 systems can take memory operands. However, whenever memory is accessed, we have to consider alignment.
The natural alignment is the size of the memory access - accessing 8 bytes requires the access to be aligned to 8 bytes boundary. Proper alignment ensures that the access will not cross cache lines, leading to performance reduction.
Imagine a 8-byte cache line. Loading 8 bytes on any 8-byte aligned address loads a full cache line. However, if the address is only 4-byte aligned, you will require loading 2 cache lines, effectively halving the speed. This is especially a problem on older processors.
SIMD, SSE, AVX
For SIMD instructions on x86, this usually means 128-bit, 256-bit, or 512-bit alignment. For simplicity we only discuss 128-bit SIMD, available since SSE.
Take a look at an instruction like addps. addps
is the SSSE3 version, and vaddps
is the AVX (or VEX-encoded version). Both instruction can take a m128
source operand, meaning that 128 bits will be loaded from memory. And as we have discussed, this address should be 128-bit aligned.
The key different in memory alignment requirements between SSE and AVX instructions is that for SSE, you get a segfault; and for AVX, it works, with a potential performance hit (due to cache lines).
This means that when using any SSE instructions (referred to as 128-bit Legacy SSE instruction), all memory operands must be aligned. When using the AVX version (referred to as VEX.128), which also works on 128-bit operands, unaligned memory operands are supported, no segfaults.
There are other instructions that explicitly require alignment, or does not require alignment - these instructions usually have “aligned” or “unaligned” in their names. For example movaps requires aligned memory operands, both SSE and AVX versions, whereas movups does not require any alignment at all.
So, which instruction to use? When in doubt, use movups. It handles all cases, aligned and unaligned. Of course, there is the issue of potential performance drop.