This loop is mostly memory bound

The performance of the loop is bounded by the DRAM bandwidth.
The performance of the loop is bounded by the bandwidth of the shared cache and DRAM.
The performance of the loop is bounded by the private cache bandwidth. The bandwidth of the shared cache and DRAM may degrade perfomance.
The performance of the loop is bounded by the L1 bandwidth.
The performance of the loop is bounded by the L2 bandwidth.
The performance of the loop is bounded by the L3 bandwidth.
The performance of the loop is bounded by the L4 bandwidth.
The performance of the loop is bounded by the DRAM bandwidth.
The performance of the loop is bounded by the MCDRAM bandwidth.
To improve performance: Improve caching efficiency and eliminate inefficient memory access patterns. The loop is also scalar. To fix: Vectorize the loop. The loop is vectorized, but vectorization efficiency is low. Scalar memory instructions might degrade application performance.

Read More