The compiler never targets loops other than innermost ones, so it vectorized the inner loop while did not vectorize the outer loop. However outer loop vectorization could be more profitable because of better Memory Access Pattern, higher Trip Counts or better Dependencies profile.
To enforce outer loop vectorization:| Target | Directive |
|---|---|
| Outer loop | #pragma omp simd |
| Inner loop | #pragma novector |
Given issue is only about opportunity to vectorize outer loop, to prove profitability you need perform deeper dive analysis (MAP, Trip Counts, Dependencies)
#pragma omp simd
for(i=0; i<N; i++)
...#pragma omp simd
for(i=0; i<N; i++)
{
#pragma novector
for(j=0; j<N; j++)
{
sum += A[i]*A[j];
}
}