Intel VTune Profiler

Version Module File Name System A System B/C System G Cloud System Note
2024.0 intel-vtune/2024.0 + + - -
2023.2 (default) intel-vtune/2023.2 + + - -
2023.1 intel-vtune/2023.1 + + - -
2022.3 intel-vtune/2022.3 + + - -

+: Available for all users
- : Not available

In the Intel compiler available environment, execute the module command as follows.

$ module load intel-vtune

For details on the module command, please refer to Modules.

Commands Purpose
vtune-gui Acrivate the GUI version of the VTune Profiler.
vtune Activate the command line version of the VTune Profiler.

Options Purpose
-collect=string Specifies the analysis type.
-app-working-dir=string Specifies the working directory.
-r, -result-dir=string Specifies the directory in which to save the result.

Types Purpose
threading Displays multi-threaded parallelism.
hotspots Analyze hotspots.
memory-access Analyze memory access.

  1. Compiling
    When using the Intel VTune Profiler, compile with debugging options -g and the same optimization options as for the actual program execution.

    $ icc -g -O2 test.c
  2. Activating VTune Profiler
    In MobaXterm or FastX, the VTune Amplifier is activated when executing the vtune-gui command. Please refer to Interactive Processing for details on the tssrun command.

    $ tssrun --x11 vtune-gui 

  3. Creating projects
    Select New Project in the center of the screen and the following screen will appear. Then enter the appropriate Project name and click Create Project.

  4. Setting projects
    Next, the Configure Analysis screen will appear. Specify the target program in Application and click OK. If an arguments is required, specify it here.

  5. Start Analysis
    Click Performance Snapshot, specify the target of the check, and then press Start with the triangle symbol.

  6. Confirmation of analysis results
    After a short wait, the results of the analysis in the VTune Profiler are displayed. In this example, we can see that the multiply1 function is using a lot of CPU time.

The GUI version supports analysis of OpenMP programs.

When starting with the tssrun command, the --rsc option specifies the number of threads.

Example:8 threads execution

tssrun --x11 --rsc t=8:c=8 vtune-gui

Please use the CUI version for an analysis of MPI program.

  1. Compiling
    When using Intel VTune Profiler, compile with debugging options -g and the same optimization options as for the actual program execution.

    $ icc -g -O2 test.c
  2. Executing Check
    A CUI version of the VTune Profiler is available with the vtune command. In this example, we are analyzing hotspots by specifying the directory to output the results. Please refer to Interactive Processing for details on the tssrun command.

$ tssrun vtune -collect hotspots  -r=./result.vtune ./a.out
salloc: Pending job allocation 723318
salloc: job 723318 queued and waiting for resources
salloc: job 723318 has been allocated resources
salloc: Granted job allocation 723318
salloc: Waiting for resource configuration
salloc: Nodes xb0013 are ready for job
vtune: Analyzing data in the node-wide mode. The hostname (xb0013) will be added to the result path/name.
vtune: Collection started.

(Omitted)

vtune: Executing actions 42 % Saving the resultElapsed Time: 0.075s
 | Application execution time is too short. Metrics data may be unreliable.
 | Consider reducing the sampling interval or increasing your application
 | execution time.
 |
    CPU Time: 0.050s
        Effective Time: 0s
        Spin Time: 0s
            Imbalance or Serial Spinning: 0s
            Lock Contention: 0s
            Other: 0s
        Overhead Time: 0.050s
         | A significant portion of CPU time is spent in synchronization or
         | threading overhead. Consider increasing task granularity or the scope
         | of data synchronization.
         |
            Creation: 0s
            Scheduling: 0s
            Reduction: 0s
            Atomics: 0s
            Other: 0.050s
    Total Thread Count: 1
    Paused Time: 0s

Top Hotspots
Function                       Module       CPU Time  % of CPU Time(%)
-----------------------------  -----------  --------  ----------------
__kmp_api_omp_get_max_threads  libiomp5.so    0.050s            100.0%
Effective Physical Core Utilization: 100.0% (112.000 out of 112)
    Effective Logical Core Utilization: 148.8% (333.230 out of 224)
Collection and Platform Info
    Application Command Line: ../oss/openmp/omp-sample/01_hello/src/c/run.x
    Operating System: 4.18.0-477.15.1.el8_8.x86_64 Red Hat Enterprise Linux release 8.8 (Ootpa)
    Computer Name: xb0013
    Result Size: 5.5 MB
    Collection start time: 09:39:56 21/09/2023 UTC
    Collection stop time: 09:39:57 21/09/2023 UTC
    Collector Type: Event-based counting driver,User-mode sampling and tracing
    CPU
        Name: Intel(R) Xeon(R) Processor code named Sapphirerapids
        Frequency: 2.000 GHz
        Logical CPU Count: 224
        LLC size: 110.1 MB
        Cache Allocation Technology
            Level 2 capability: available
            Level 3 capability: available

If you want to skip descriptions of detected performance issues in the report,
enter: vtune -report summary -report-knob show-issues=false -r <my_result_dir>.
Alternatively, you may view the report in the csv format: vtune -report
<report_name> -format=csv.
vtune: Executing actions 100 % done
salloc: Relinquishing job allocation 723318

exit code: 0
  1. Checking the results
    Execute the vtune-gui command in the X (GUI) available environment, such as MobaXterm or FastX, and check the results with the GUI version of VTune Amplifier. Please refer to Interactive Processing for details on the tssrun command.

    $ tssrun --x11 vtune-gui ./result/result.vtune

The CUI version supports analysis of MPI and OpenMP programs. When executing VTune with tssrun, the --rsc option specifies the number of parallels.

Example: Analysis with MPI4 parallel

$ tssrun --rsc p=4 vtune -collect hotspots  -r=./result ./a.out
salloc: Pending job allocation 723080
salloc: job 723080 queued and waiting for resources
salloc: job 723080 has been allocated resources
salloc: Granted job allocation 723080
vtune: Analyzing data in the node-wide mode. The hostname (xb0127) will be added to the result path/name.
vtune: Collection started.

(Omitted)                             

vtune: ExecutinElapsed Time: 1.015sg the result
    CPU Time: 1.710s
        Effective Time: 1.710s
        Spin Time: 0s
            MPI Busy Wait Time: 0s
            Other: 0s
        Overhead Time: 0s
            Other: 0s
    Total Thread Count: 8
    Paused Time: 0s

Top Hotspots
Function                Module                CPU Time  % of CPU Time(%)
----------------------  --------------------  --------  ----------------
read                    libc.so.6               0.670s             39.2%
PMPI_Init               libmpi.so.12            0.112s              6.6%
main                    allrank                 0.100s              5.8%
dlopen                  libdl.so.2              0.089s              5.2%
[ld-linux-x86-64.so.2]  ld-linux-x86-64.so.2    0.080s              4.7%
[Others]                N/A                     0.658s             38.5%
Effective Physical Core Utilization: 89.5% (100.261 out of 112)
    Effective Logical Core Utilization: 45.0% (100.843 out of 224)
     | The metric value is low, which may signal a poor utilization of logical
     | CPU cores while the utilization of physical cores is acceptable. Consider
     | using logical cores, which in some cases can improve processor throughput
     | and overall performance of multi-threaded applications.
     |
Collection and Platform Info
    Application Command Line: ../lecture/20230906/mpi/allrank
    Operating System: 4.18.0-477.15.1.el8_8.x86_64 Red Hat Enterprise Linux release 8.8 (Ootpa)
    Computer Name: xb0127
    Result Size: 7.2 MB
    Collection start time: 09:24:37 21/09/2023 UTC
    Collection stop time: 09:24:38 21/09/2023 UTC
    Collector Type: Event-based counting driver,User-mode sampling and tracing
    CPU
        Name: Intel(R) Xeon(R) Processor code named Sapphirerapids
        Frequency: 2.000 GHz
        Logical CPU Count: 224
        LLC size: 110.1 MB
        Cache Allocation Technology
            Level 2 capability: available
            Level 3 capability: available

If you want to skip descriptions of detected performance issues in the report,
enter: vtune -report summary -report-knob show-issues=false -r <my_result_dir>.
Alternatively, you may view the report in the csv format: vtune -report
<report_name> -format=csv.
vtune: Executing actions 100 % done
salloc: Relinquishing job allocation 723080

exit code: 0