Intel Advisor offers Vectorization Advisor, a vectorization optimization tool, and Threading Advisor, a threading design and prototyping tool, to help ensure your Fortran, C and C++ applications realize full performance potential on modern processors, such as Intel® Xeon Phi™ processors.
This ReadMe shows how to use the Intel® Advisor to improve the performance of a C++ sample application. Follow these steps:
Follow these initial steps to use the Intel Advisor standalone GUI to try out the vec_samples sample application.
You need the following tools:
Intel Advisor installation package and license
Version 15.x or higher of the Intel C++ compiler or a supported compiler
Use an Intel compiler to get more benefit (and version 17.x to get the most benefit) from the Vectorization Advisor Survey Report. See the Release Notes for more information on supported compilers.
.tgz file extraction utility
If you do not already have access to the Intel Advisor or the Intel C++ compiler, download an evaluation copy from http://software.intel.com/en-us/articles/intel-software-evaluation-center/.
To set up the vec_samples sample application:
Copy the vec_samples.tgz file from the <advisor-install-dir>/samples/<locale>/C++/ directory to a writable directory or share on your system.
The default installation path, <advisor-install-dir>, is below:
/opt/intel/ for root users
$HOME/intel/ for non-root users
Extract the sample from the .tgz file.
To follow along with this ReadMe, in a terminal session:
Set up the C++ compiler environment. For example, in a terminal session, type one of the following:
source /<compiler-install-dir>/bin/compilervars.csh ia32 or source /<compiler-install-dir>/bin/compilervars.sh ia32
source /<compiler-install-dir>/bin/compilervars.csh intel64 or source /<compiler-install-dir>/bin/compilervars.sh intel64
See Specifying Location of Compiler Components with compilervars File in the Intel® C++ Compiler <version> User and Reference Guide for more information.
Change directory to the vec_samples/ directory in its unzipped location.
Build the target sample application in release mode using the make baseline command, which contains these compiler options: -O2 -g
Do one of the following to set up the Intel Advisor environment.
Run the source <advisor-install-dir>/advixe-vars.csh or source <advisor-install-dir>/advixe-vars.sh command
Add <advisor-install-dir>/bin32 or <advisor-install-dir>/bin64 to your path.
Run the <parallel-studio-install-dir>/psxevars.csh or <parallel-studio-install-dir>/psxevars.sh command
The default installation path for both <advisor-install-dir> and <parallel-studio-install-dir> is below:
/opt/intel/ for root users
$HOME/intel/ for non-root users
Type advixe-gui & to run the Intel Advisor standalone GUI in the background.
Type vec_samples in the Project name field, supply a location for the sample application project, then click the Create Project button to open the Project Properties dialog box.
In the Analysis Target tab, ensure the Survey Hotspots Analysis type is selected.
Click the Browse... button next to the Application field and choose the just-built vec_samples binary file.
For the Survey Trip Count Analysis, Dependencies Analysis, and Memory Access Patterns Analysis types, make sure the Inherit settings from Survey Hotspots Analysis Type checkbox is selected.
Click the OK button to close the Project Properties dialog box.
If an infotip displays, select the Do not show this window again checkbox and close the infotip.
To set a performance baseline for the improvements that will follow, do the following.
In
the
Vectorization Workflow pane, click the
control under
Survey Target to produce a
Survey Report.
Intel Advisor stores only the most recent analysis result. To save a result snapshot you can view any time, do the following:
Click the
icon.
Type snapshot_baseline in the Result name field.
Select the Pack into archive checkbox to enable the Result path field.
Browse to a desired location, then click the OK button to save a read-only snapshot of the current result.
If the Survey Report remains grayed out after the snapshot process is complete, click anywhere on the report.
The Elapsed time value in the top left corner. This is the baseline against which subsequent improvements will be measured.
In the Type column, all detected loops are scalar.
In the Why No Vectorization? column, the compiler detected or assumed a vector dependence in most loops.
For one of the loops where the compiler detected or assumed a
vector dependence, click the
control to display
how-can-I-fix-this-issue? information in the
Compiler Diagnostic Details pane.
Click Summary on the navigation toolbar to open the Summary window. Think of this window as a dashboard to which the Intel Advisor adds data each time you run Intel Advisor tools.
Two pointers are aliased if both point to the same memory location. Storing to memory using a pointer that might be aliased may prevent some optimizations. For example, it may create a dependency between loop iterations that would make vectorization unsafe. Sometimes the compiler can generate both a vectorized and a non-vectorized version of a loop and test for aliasing at runtime to select the appropriate code path. If you know pointers do not alias, and inform the compiler, it can avoid the runtime check and generate a single vectorized code path.
In Multiply.c, the compiler generates runtime checks to determine if point b in function matvec(FTYPE a[][COLWIDTH], FTYPE b[], FTYPE x[]) is aliased to either a or x. If Multiply.c is compiled with the NOALIAS macro, the restrict qualifier of argument b informs the compiler the pointer does not alias with any other pointer and array b does not overlap with a or x.
To see if the NOALIAS macro improves performance, do the following.
In the terminal session, type make clean, then rebuild the target using the make noalias command, which contains these compiler options:
-O2 -g
-D NOALIAS
In the
Vectorization Workflow pane, click the
control under
Survey Target.
In the
Vectorization Workflow pane, click the
control under
Find Trip Counts.
If the FLOPS profiler disabled warning displays, click the Continue button.
Click the
icon and save a
snapshot_noalias result.
In the new Survey Report, notice:
The value in the Vector Instruction Set column in the top pane is probably SSE2, the default Vector Instruction Set Architecture (ISA). Also, there is probably a warning at the top of the Survey Report: Higher instruction set architecture (ISA) available; consider recompiling your application using a higher ISA. We will do this later.
The compiler successfully vectorizes two loops: in matvec at Multiply.c:69 and in matvec at Multiply.c:60.
The Elapsed time improves substantially.
Click the
icon next to the two vectorized loops. Notice both
loops have a remainder loop present. Click the
icon in the
Trips Counts column set to expand it. The
remainder loops are present because the trip count values for the remainder
loops are not a multiple of the
VL (Vector Length) value.
Check the changes in the new Summary.
Generating code for different instruction sets available on your compilation host processor may improve performance.
The xHost option tells the compiler to generate instructions for the highest instruction set available on the compilation host processor.
To see if the xHost option improves performance, do the following.
In the terminal session, type make clean, then rebuild the target using the make xhost command, which contains these compiler options:
-g
-D NOALIAS
-xHost
For safety purposes, the compiler is often conservative when assuming data dependencies. Use a Dependencies-focused Refinement Report to check for real data dependencies in loops the compiler did not vectorize because of assumed dependencies. If real dependencies are detected, the analysis can provide additional details to help resolve the dependencies. Your objective: Identify and better characterize real data dependencies that could make forced vectorization unsafe.
Notice the
Vector Issue for the loop in
matvec at
Multiply.c:82:
Assumed dependency present. Click the
icon in the
Vector Issues column to display the associated
recommendation in the bottom pane:
Confirm dependency is real. There may also be a recommendation to
Remove dependency with a reduction example.
To identify, explore, and fix real loop-carried dependencies, do the following.
In the
column in the
Survey Report, select the checkbox for the
loop in
matvec at
Multiply.c:82.
In the
Vectorization Workflow pane, click the
control under
Check Dependencies to produce a
Dependencies Report.
If analysis time during this exercise is a consideration: After
the
Dependencies Report tab displays a
RAW (Read after write) or a
WAW (Write after write) dependency, click
the
control under
Check Dependencies to stop the current
analysis and display the result collected thus far.
Click the
icon and save a
snapshot_dependencies result.
In the top pane of the Refinement Reports window, notice the Intel Advisor reports a RAW and a WAW dependency in the loop in matvec at Multiply.c:82. The Dependencies Report tab in the bottom pane shows the source of the dependency: Addition in the sumx variable.
The REDUCTION define applies an omp simd directive with a reduction clause, so each SIMD lane computes its own sum, and the results are combined at the end. (Applying an omp simd directive without a reduction clause will generate incorrect code.)
To see if the REDUCTION define vectorizes the loop: In the terminal session, type make clean, then rebuild rebuild the target using the make reduction command, which contains these compiler options:
-g -qopenmp
-D NOALIAS
-xHost
-D REDUCTION
Use a MAP-focused Refinement Report to check for various memory issues, such as non-contiguous memory accesses and unit stride vs. non-unit stride accesses. Your objective: Eliminate issues that could lead to significant vector code execution slowdown or block automatic vectorization by the compiler.
To identify and explore data structure layout in memory, do the following.
In the
column in the
Survey Report, select the checkbox for the
loop in:
matvec at Multiply.c:69
matvec at Multiply.c:60
Notice the loop in matvec at Multiply.c:60 has a better Efficiency and Gain Estimate than the loop in matvec at Multiply.c:69.
In the
Vectorization Workflow pane, click the
control under
Check Memory Access Patterns to produce a
Memory Access Patterns Report.
Click the
icon and save a
snapshot_map result.
In the top pane of the Refinement Reports window, hover your mouse over the Strides Distribution values to display explanatory tooltips for the loops matvec at Multiply.c:69 and matvec at Multiply.c:60. In the loop in matvec at Multiply.c:60, the percentage of memory instructions with stride 1 (unit stride) or stride 0 (uniform) accesses is 100%, which contributes to the better Efficiency and Gain Estimate.
Return to the
Survey Report. Notice the content in
Survey Report has also changed: In the
Vector Issue column, the
Intel Advisor
reports
Inefficient memory patterns present for the
loop in
matvec at
Multiply.c:69. Click the
icon in the
Vector Issues column to display the
associated recommendations in the bottom pane.
Check the changes in the new Summary.
We will not fix the cause of this issue (the code is indexing the row of array a by l) in this ReadMe. (In C/C++ applications, access the column index by the innermost loop; in this case, you could modify the code to access array a as follows: a[i][l].)
Sometimes data alignment can improve vectorization, and sometimes it is prerequisite to complete and effective vectorization. When data in a loop with potential for vectorization is not aligned, the compiler may generate a:
Peeled loop to align memory accesses inside the loop body and maximize loop efficiency
Remainder loop to clean up any remaining iterations that do not fit within the scope of the loop body
Aligns arrays a, b, and x in Driver.c on a 16- or 32-byte boundary depending on the instruction set architecture.
Pads the row length of the matrix, a, to be a multiple of 16 or 32 bytes, so each individual row of a is 16- or 32-byte aligned.
Tells the compiler it can safely assume the arrays in Multiply.c are aligned.
To see if the ALIGNED macro improves performance, do the following.
In the terminal session, type make clean, then rebuild the target using the make align command, which contains these compiler options:
-g -qopenmp
-D NOALIAS
-xHost
-D REDUCTION
-D ALIGNED
The compiler determines vectorization is unsafe when it cannot tell if a loop contains unique arrays. If you inline such a loop, the compiler can tell exactly which variables you want processed in the loop, and can therefore determine vectorization is safe.
When you use the matvec function in the sample application, the compiler cannot tell if a and b are unique arrays. The NOFUNCCALL macro removes the matvec function and inlines the loop instead.
To see if the NOFUNCCALL macro improves performance, do the following.
In the terminal session, type make clean, then rebuild the target using the make nofunc command, which contains these compiler options:
-g -qopenmp
-D NOALIAS
-xHost
-D REDUCTION
-D ALIGNED
-D NOFUNCCALL
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
The products and services described may contain defects or errors which may cause deviations from published specifications. Current characterized errata are available on request.
Cilk, Intel, the Intel logo, Intel Atom, Intel Core, Intel Inside, Intel NetBurst, Intel SpeedStep, Intel vPro, Intel Xeon Phi, Intel XScale, Itanium, MMX, Pentium, Thunderbolt, Ultrabook, VTune and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
© Intel Corporation