One of the memory accesses in the source loop does not start at an optimally aligned address boundary. To fix: Align the data and tell the compiler the data is aligned.
Dynamic Data:
To align dynamic data, replace malloc() and free() with _mm_malloc() and _mm_free(). To tell the compiler the data is aligned, use __assume_aligned() before the source loop. Also consider using #include <aligned_new> to enable automatic allocation of aligned data.
Static Data:
To align static data, use __declspec(align()). To tell the compiler the data is aligned, use __assume_aligned() before the source loop.
Align dynamic data using a 64-byte boundary and tell the compiler the data is aligned:
float *array;
array = (float *)_mm_malloc(ARRAY_SIZE*sizeof(float), 32);
// Somewhere else
__assume_aligned(array, 32);
// Use array in loop
_mm_free(array);Align static data using a 64-byte boundary:
__declspec(align(64)) float array[ARRAY_SIZE]