The loop is threaded and auto-vectorized; however, the trip count is not a multiple of vector length. To fix: Do all of the following:
- Use the #pragma omp parallel for simd directive to parallelize the loop with both threads and SIMD instructions. Specifically, this directive divides loop iterations into chunks (subsets) and distributes the chunks among threads, then chunk iterations execute concurrently using SIMD instructions.
- Add the schedule(simd: [kind]) modifier to the directive to guarantee the chunk size (number of iterations per chunk) is a multiple of vector length.
...
#pragma omp parallel for schedule(static)
for (int i = 0; i < n; i++)
...void f(int a[], int b[], int c[])
{
#pragma omp parallel for schedule(static)
for (int i = 0; i < n; i++)
{
a[i] = b[i] + c[i];
}
}...
#pragma omp parallel for simd schedule(simd:static)
for (int i = 0; i < n; i++)
...void f(int a[], int b[], int c[])
{
#pragma omp parallel for simd schedule(simd:static)
for (int i = 0; i < n; i++)
{
a[i] = b[i] + c[i];
}
}