Consider outer loop vectorization

The compiler did not vectorize the loop as the code exceeds the compilers complexity criteria. You might get higher performance if you enforce the loop vectorization. Use a directive right before your loop block in the source code.
ICL/ICC/ICPC Directive
#pragma omp simd

Given issue is only about opportunity to vectorize outer loop, to prove profitability you need perform deeper dive analysis (MAP, Trip Counts, Dependencies)

Read More