Collect FLOP data with callstacks to enhance conclusions for the outer loop
This loop is mostly memory bound
The performance of the loop is bounded by the DRAM bandwidth.
The performance of the loop is bounded by the bandwidth of the shared cache and DRAM.
The performance of the loop is bounded by the private cache bandwidth. The bandwidth of the shared cache and DRAM may degrade perfomance.
The performance of the loop is bounded by the L1 bandwidth.
The performance of the loop is bounded by the L2 bandwidth.
The performance of the loop is bounded by the L3 bandwidth.
The performance of the loop is bounded by the L4 bandwidth.
The performance of the loop is bounded by the DRAM bandwidth.
The performance of the loop is bounded by the MCDRAM bandwidth.
This loop is mostly compute bound
The bottleneck depends heavily on the FMA computational unit.
The bottleneck depends greatly on the accessed computational unit.
This loop is mostly memory bound but may also be compute bound
The performance of the loop is bounded by the DRAM bandwidth.
The performance of the loop is bounded by the bandwidth of the shared cache and DRAM.
The performance of the loop is bounded by the private cache bandwidth. The bandwidth of the shared cache and DRAM may degrade perfomance.
The performance of the loop is bounded by the L1 bandwidth.
The performance of the loop is bounded by the L2 bandwidth.
The performance of the loop is bounded by the L3 bandwidth.
The performance of the loop is bounded by the L4 bandwidth.
The performance of the loop is bounded by the DRAM bandwidth.
The performance of the loop is bounded by the MCDRAM bandwidth.
This loop is mostly compute bound but may also be memory bound
The bottleneck depends heavily on the FMA computational unit.
The bottleneck depends greatly on the accessed computational unit.
The performance of the loop is also bounded by the DRAM bandwidth.
The performance of the loop is also bounded by the bandwidth of the shared cache and DRAM.
The performance of the loop is also bounded by the private cache bandwidth.
The performance of the loop is also bounded by the L1 bandwidth.
The performance of the loop is also bounded by the L2 bandwidth.
The performance of the loop is also bounded by the L3 bandwidth.
The performance of the loop is also bounded by the L4 bandwidth.
The performance of the loop is also bounded by the DRAM bandwidth.
The performance of the loop is also bounded by the MCDRAM bandwidth.
You can switch to the Recommendations tab to see optimization recommendations in the Roofline Conclusions section.