The bottleneck depends heavily on the fused multiply-add (FMA) computational unit.
The bottleneck depends greatly on the accessed computational unit.
To improve performance: Switch from %c_isa% to %b_isa% - the highest available instruction set architecture (ISA). Eliminate inefficient memory access patterns. %traits% instructions might degrade performance. The loop is scalar. To fix: Vectorize the loop.
The bottleneck depends greatly on the accessed computational unit.
To improve performance: Switch from %c_isa% to %b_isa% - the highest available instruction set architecture (ISA). Eliminate inefficient memory access patterns. %traits% instructions might degrade performance. The loop is scalar. To fix: Vectorize the loop.