Vectorization Sample for the Intel® Advisor - Linux* OS

Intel Advisor offers Vectorization Advisor, a vectorization optimization tool, and Threading Advisor, a threading design and prototyping tool, to help ensure your Fortran, C and C++ applications realize full performance potential on modern processors, such as Intel® Xeon Phi™ processors.

This ReadMe shows how to use the Intel® Advisor to improve the performance of a C++ sample application. Follow these steps:

  1. Prepare the sample application.

  2. Establish a performance baseline.

  3. Disambiguate pointers.

  4. Generate instructions for the highest instruction set available.

  5. Handle dependencies.

  6. Analyze memory access patterns.

  7. Align data.

  8. Reorganize code.

Standalone GUI: Prepare the Sample Application

Follow these initial steps to use the Intel Advisor standalone GUI to try out the vec_samples sample application.

  • Get software tools and unpack the sample application.

  • Prepare the sample application.

  • Launch the Intel Advisor.

  • Prepare the project.

Get Software Tools and Unpack the Sample Application

You need the following tools:

  • Intel Advisor installation package and license

  • Version 15.x or higher of the Intel C++ compiler or a supported compiler

    Use an Intel compiler to get more benefit (and version 17.x to get the most benefit) from the Vectorization Advisor Survey Report. See the Release Notes for more information on supported compilers.

  • .tgz file extraction utility

If you do not already have access to the Intel Advisor or the Intel C++ compiler, download an evaluation copy from http://software.intel.com/en-us/articles/intel-software-evaluation-center/.

To set up the vec_samples sample application:

  1. Copy the vec_samples.tgz file from the <advisor-install-dir>/samples/<locale>/C++/ directory to a writable directory or share on your system.

    The default installation path, <advisor-install-dir>, is below:

    • /opt/intel/ for root users

    • $HOME/intel/ for non-root users

  2. Extract the sample from the .tgz file.

Prepare the Sample Application

To follow along with this ReadMe, in a terminal session:

  1. Set up the C++ compiler environment. For example, in a terminal session, type one of the following:

    • source /<compiler-install-dir>/bin/compilervars.csh ia32 or source /<compiler-install-dir>/bin/compilervars.sh ia32

    • source /<compiler-install-dir>/bin/compilervars.csh intel64 or source /<compiler-install-dir>/bin/compilervars.sh intel64

      See Specifying Location of Compiler Components with compilervars File in the Intel® C++ Compiler <version> User and Reference Guide for more information.

  2. Change directory to the vec_samples/ directory in its unzipped location.

  3. Build the target sample application in release mode using the make baseline command, which contains these compiler options: -O2 -g

Tip

  • For your convenience, there are also Instruction Set Architecture-related Release choices.

  • Keep the terminal session open.

Launch the Intel Advisor

In the terminal session:

  1. Do one of the following to set up the Intel Advisor environment.

    • Run the source <advisor-install-dir>/advixe-vars.csh or source <advisor-install-dir>/advixe-vars.sh command

    • Add <advisor-install-dir>/bin32 or <advisor-install-dir>/bin64 to your path.

    • Run the <parallel-studio-install-dir>/psxevars.csh or <parallel-studio-install-dir>/psxevars.sh command

    The default installation path for both <advisor-install-dir> and <parallel-studio-install-dir> is below:

    • /opt/intel/ for root users

    • $HOME/intel/ for non-root users

  2. Type advixe-gui & to run the Intel Advisor standalone GUI in the background.

Prepare the Project

  1. Choose File > New > Project... (or click New Project... in the Welcome page) to display a Create a Project dialog box.
  2. Type vec_samples in the Project name field, supply a location for the sample application project, then click the Create Project button to open the Project Properties dialog box.

  3. In the Analysis Target tab, ensure the Survey Hotspots Analysis type is selected.

  4. Click the Browse... button next to the Application field and choose the just-built vec_samples binary file.

  5. For the Survey Trip Count Analysis, Dependencies Analysis, and Memory Access Patterns Analysis types, make sure the Inherit settings from Survey Hotspots Analysis Type checkbox is selected.

  6. Click the OK button to close the Project Properties dialog box.

  7. If an infotip displays, select the Do not show this window again checkbox and close the infotip.

Establish Performance Baseline

To set a performance baseline for the improvements that will follow, do the following.

Run a Survey Analysis

In the Vectorization Workflow pane, click the Intel Advisor control: Run analysis control under Survey Target to produce a Survey Report.

Save a Result Snapshot

Intel Advisor stores only the most recent analysis result. To save a result snapshot you can view any time, do the following:

  1. Click the Intel Advisor control: Snapshot icon.

  2. Type snapshot_baseline in the Result name field.

  3. Select the Pack into archive checkbox to enable the Result path field.

  4. Browse to a desired location, then click the OK button to save a read-only snapshot of the current result.

  5. If the Survey Report remains grayed out after the snapshot process is complete, click anywhere on the report.

Assess Performance

  1. In the Survey Report, notice:

    • The Elapsed time value in the top left corner. This is the baseline against which subsequent improvements will be measured.

    • In the Type column, all detected loops are scalar.

    • In the Why No Vectorization? column, the compiler detected or assumed a vector dependence in most loops.

  2. For one of the loops where the compiler detected or assumed a vector dependence, click the Intel Advisor control: Compiler diagnostic details control to display how-can-I-fix-this-issue? information in the Compiler Diagnostic Details pane.

  3. Click Summary on the navigation toolbar to open the Summary window. Think of this window as a dashboard to which the Intel Advisor adds data each time you run Intel Advisor tools.

Disambiguate Pointers

Two pointers are aliased if both point to the same memory location. Storing to memory using a pointer that might be aliased may prevent some optimizations. For example, it may create a dependency between loop iterations that would make vectorization unsafe. Sometimes the compiler can generate both a vectorized and a non-vectorized version of a loop and test for aliasing at runtime to select the appropriate code path. If you know pointers do not alias, and inform the compiler, it can avoid the runtime check and generate a single vectorized code path.

In Multiply.c, the compiler generates runtime checks to determine if point b in function matvec(FTYPE a[][COLWIDTH], FTYPE b[], FTYPE x[]) is aliased to either a or x. If Multiply.c is compiled with the NOALIAS macro, the restrict qualifier of argument b informs the compiler the pointer does not alias with any other pointer and array b does not overlap with a or x.

To see if the NOALIAS macro improves performance, do the following.

Rebuild the Target With the NOALIAS Macro

In the terminal session, type make clean, then rebuild the target using the make noalias command, which contains these compiler options:

  • -O2 -g

  • -D NOALIAS

Re-run the Survey Analysis and Run a Trips Counts Analysis

  1. In the Vectorization Workflow pane, click the Intel Advisor control: Run analysis control under Survey Target.

  2. In the Vectorization Workflow pane, click the Intel Advisor control: Run analysis control under Find Trip Counts.

  3. If the FLOPS profiler disabled warning displays, click the Continue button.

  4. Click the Intel Advisor control: Snapshot icon and save a snapshot_noalias result.

Assess Impact on Performance

  1. In the new Survey Report, notice:

    • The value in the Vector Instruction Set column in the top pane is probably SSE2, the default Vector Instruction Set Architecture (ISA). Also, there is probably a warning at the top of the Survey Report: Higher instruction set architecture (ISA) available; consider recompiling your application using a higher ISA. We will do this later.

    • The compiler successfully vectorizes two loops: in matvec at Multiply.c:69 and in matvec at Multiply.c:60.

    • The Elapsed time improves substantially.

  2. Click the Intel Advisor control: Expand data row icon next to the two vectorized loops. Notice both loops have a remainder loop present. Click the Intel Advisor control: Expand column set icon in the Trips Counts column set to expand it. The remainder loops are present because the trip count values for the remainder loops are not a multiple of the VL (Vector Length) value.

  3. Check the changes in the new Summary.

Generate Instructions for Highest Instruction Set Available

Generating code for different instruction sets available on your compilation host processor may improve performance.

The xHost option tells the compiler to generate instructions for the highest instruction set available on the compilation host processor.

To see if the xHost option improves performance, do the following.

Rebuild the Target With the xHost Option

In the terminal session, type make clean, then rebuild the target using the make xhost command, which contains these compiler options:

  • -g

  • -D NOALIAS

  • -xHost

Re-run the Survey Analysis

  1. In the Vectorization Workflow pane, click the Intel Advisor control: Run analysis control under Survey Target.

  2. If the Target binaries changed since the last Trip Counts analysis run warning displays, click the Delete button.

  3. Click the Intel Advisor control: Snapshot icon and save a snapshot_xhost result.

Assess Impact on Performance

  1. In the new Survey Report, notice:

    • The Elapsed time improves.

    • The values in the Vector ISA and VL columns in the top pane (probably) change.

  2. Check the changes in the new Summary.

Handle Dependencies

For safety purposes, the compiler is often conservative when assuming data dependencies. Use a Dependencies-focused Refinement Report to check for real data dependencies in loops the compiler did not vectorize because of assumed dependencies. If real dependencies are detected, the analysis can provide additional details to help resolve the dependencies. Your objective: Identify and better characterize real data dependencies that could make forced vectorization unsafe.

Notice the Vector Issue for the loop in matvec at Multiply.c:82: Assumed dependency present. Click the Intel Advisor control: Recommendations icon in the Vector Issues column to display the associated recommendation in the bottom pane: Confirm dependency is real. There may also be a recommendation to Remove dependency with a reduction example.

To identify, explore, and fix real loop-carried dependencies, do the following.

Run a Dependencies Analysis

  1. In the column in the Survey Report, select the checkbox for the loop in matvec at Multiply.c:82.

  2. In the Vectorization Workflow pane, click the Intel Advisor control: Run analysis control under Check Dependencies to produce a Dependencies Report.

    If analysis time during this exercise is a consideration: After the Dependencies Report tab displays a RAW (Read after write) or a WAW (Write after write) dependency, click the Intel Advisor control: Stop analysis and display result collected thus far control under Check Dependencies to stop the current analysis and display the result collected thus far.

  3. Click the Intel Advisor control: Snapshot icon and save a snapshot_dependencies result.

Assess Dependencies

In the top pane of the Refinement Reports window, notice the Intel Advisor reports a RAW and a WAW dependency in the loop in matvec at Multiply.c:82. The Dependencies Report tab in the bottom pane shows the source of the dependency: Addition in the sumx variable.

The REDUCTION define applies an omp simd directive with a reduction clause, so each SIMD lane computes its own sum, and the results are combined at the end. (Applying an omp simd directive without a reduction clause will generate incorrect code.)

Rebuild the Target With the REDUCTION Define

To see if the REDUCTION define vectorizes the loop: In the terminal session, type make clean, then rebuild rebuild the target using the make reduction command, which contains these compiler options:

  • -g -qopenmp

  • -D NOALIAS

  • -xHost

  • -D REDUCTION

Re-run the Survey Analysis

  1. In the Vectorization Workflow pane, click the Intel Advisor control: Run analysis control under Survey Target.

  2. Click the Intel Advisor control: Snapshot icon and save a snapshot_reduction result.

Assess Impact on Dependencies

  1. In the new Survey Report, notice the assumed dependency is gone and the loop in matvec at Multiply.c:82 is now vectorized.

  2. Check the changes in the new Summary.

Analyze Memory Access Patterns

Use a MAP-focused Refinement Report to check for various memory issues, such as non-contiguous memory accesses and unit stride vs. non-unit stride accesses. Your objective: Eliminate issues that could lead to significant vector code execution slowdown or block automatic vectorization by the compiler.

To identify and explore data structure layout in memory, do the following.

Run a Memory Access Patterns Analysis

  1. In the column in the Survey Report, select the checkbox for the loop in:

    • matvec at Multiply.c:69

    • matvec at Multiply.c:60

  2. Notice the loop in matvec at Multiply.c:60 has a better Efficiency and Gain Estimate than the loop in matvec at Multiply.c:69.

  3. In the Vectorization Workflow pane, click the Intel Advisor control: Run analysis control under Check Memory Access Patterns to produce a Memory Access Patterns Report.

  4. Click the Intel Advisor control: Snapshot icon and save a snapshot_map result.

Assess Memory Issues

  1. In the top pane of the Refinement Reports window, hover your mouse over the Strides Distribution values to display explanatory tooltips for the loops matvec at Multiply.c:69 and matvec at Multiply.c:60. In the loop in matvec at Multiply.c:60, the percentage of memory instructions with stride 1 (unit stride) or stride 0 (uniform) accesses is 100%, which contributes to the better Efficiency and Gain Estimate.

  2. Return to the Survey Report. Notice the content in Survey Report has also changed: In the Vector Issue column, the Intel Advisor reports Inefficient memory patterns present for the loop in matvec at Multiply.c:69. Click the Intel Advisor control: Recommendations icon in the Vector Issues column to display the associated recommendations in the bottom pane.

  3. Check the changes in the new Summary.

We will not fix the cause of this issue (the code is indexing the row of array a by l) in this ReadMe. (In C/C++ applications, access the column index by the innermost loop; in this case, you could modify the code to access array a as follows: a[i][l].)

Align Data

Sometimes data alignment can improve vectorization, and sometimes it is prerequisite to complete and effective vectorization. When data in a loop with potential for vectorization is not aligned, the compiler may generate a:

  • Peeled loop to align memory accesses inside the loop body and maximize loop efficiency

  • Remainder loop to clean up any remaining iterations that do not fit within the scope of the loop body

Bottom line: Aligned loads are faster than unaligned loads; however the speed difference depends on the processor.

The ALIGNED macro:

  • Aligns arrays a, b, and x in Driver.c on a 16- or 32-byte boundary depending on the instruction set architecture.

  • Pads the row length of the matrix, a, to be a multiple of 16 or 32 bytes, so each individual row of a is 16- or 32-byte aligned.

  • Tells the compiler it can safely assume the arrays in Multiply.c are aligned.

To see if the ALIGNED macro improves performance, do the following.

Rebuild the Target With the ALIGNED Macro

In the terminal session, type make clean, then rebuild the target using the make align command, which contains these compiler options:

  • -g -qopenmp

  • -D NOALIAS

  • -xHost

  • -D REDUCTION

  • -D ALIGNED

Re-run the Survey Analysis

  1. In the Vectorization Workflow pane, click the Intel Advisor control: Run analysis control under Survey Target.

  2. Click the Intel Advisor control: Snapshot icon and save a snapshot_align result.

Assess Impact on Performance

  1. In the new Survey Report, notice the peeled loop disappears and the Elapsed time improves.

  2. Check the changes in the new Summary.

Reorganize Code

The compiler determines vectorization is unsafe when it cannot tell if a loop contains unique arrays. If you inline such a loop, the compiler can tell exactly which variables you want processed in the loop, and can therefore determine vectorization is safe.

When you use the matvec function in the sample application, the compiler cannot tell if a and b are unique arrays. The NOFUNCCALL macro removes the matvec function and inlines the loop instead.

To see if the NOFUNCCALL macro improves performance, do the following.

Rebuild the Target With the NOFUNCCALL Macro

In the terminal session, type make clean, then rebuild the target using the make nofunc command, which contains these compiler options:

  • -g -qopenmp

  • -D NOALIAS

  • -xHost

  • -D REDUCTION

  • -D ALIGNED

  • -D NOFUNCCALL

Re-run the Survey Analysis

  1. In the Vectorization Workflow pane, click the Intel Advisor control: Run analysis control under Survey Target.

  2. Click the Intel Advisor control: Snapshot icon and save a snapshot_nofunc result.

Assess Impact on Performance

  1. In the new Survey Report, notice the Elapsed time improves.

  2. Compare the new Summary to the baseline Summary. (This is easy to do if you took snapshots during each ReadMe step. From the File menu, choose Open > Result and choose the saved snapshot. )

Legal Information

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors which may cause deviations from published specifications. Current characterized errata are available on request.

Cilk, Intel, the Intel logo, Intel Atom, Intel Core, Intel Inside, Intel NetBurst, Intel SpeedStep, Intel vPro, Intel Xeon Phi, Intel XScale, Itanium, MMX, Pentium, Thunderbolt, Ultrabook, VTune and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

© Intel Corporation