CUPTI Python Samples — CUPTI Python

User Guide
CUPTI Python Samples

CUPTI Python Samples#

Once CUPTI Python is installed, the CUPTI samples are located under the site-packages/cupti-python-samples directory. You can determine the location of your site-packages directory by executing the following command:

$ python3 -m site

Setting up Numba CUDA#

The CUPTI Python Numba samples require the numba-cuda package along with the dependencies for CUDA 13.0. You can install numba-cuda using the following command:

$ pip install numba-cuda[cu13]

Samples#

The CuptiVectorAdd* samples have a simple code which does element by element vector addition.

CuptiVectorAddNumba.py#

CUPTI Python sample which shows use of CUPTI Activity APIs. This sample uses numba-cuda.

Command line options: --profile, -p

Enable CUPTI based profiling. Default: OFF

--output, -o OUTPUT_TYPE

Select the profiler output format. OUTPUT_TYPE can be: brief, detailed, or none. Default: brief

--help, -h

Shows the usage.

CuptiVectorAddNumbaCallback.py#

CUPTI Python sample which shows use of CUPTI Callback APIs. This sample uses numba-cuda.

Command line options: --profile, -p

Enable CUPTI based profiling. Default: OFF

--output, -o OUTPUT_TYPE

Select the profiler output format. OUTPUT_TYPE can be: brief, detailed, or none. Default: brief

--help, -h

Shows the usage.

CuptiVectorAddDrv.py#

CUPTI Python sample which shows use of CUPTI Activity APIs. This sample uses CUDA Python Driver APIs from cuda-bindings. It also shows how to use CUDA profiler start and stop APIs to define the range of code to be profiled.

This sample uses NVRTC (NVIDIA Runtime Compilation) to compile CUDA kernel code to PTX at runtime. The sample demonstrates:

Using cuda.bindings.nvrtc to compile CUDA kernel source code to PTX
Using cuda.bindings.driver APIs to load the PTX module and launch kernels
Using CUPTI Activity APIs to profile the CUDA operations

For ensuring cuda-bindings is set up correctly along with the necessary CUDA Toolkit (CTK) components (including NVRTC), please refer to the cuda-bindings runtime requirements documentation.

Command line options: --profile, -p

Enable CUPTI based profiling. Default: OFF

--define-profile-range, -r

Include CUDA profiler start and stop APIs to define the range of code to be profiled. Default: OFF

--output, -o OUTPUT_TYPE

Select the profiler output format. OUTPUT_TYPE can be: brief, detailed, or none. Default: brief

--help, -h

Shows the usage.

cupyprof.py#

CUPTI Python sample which shows how to profile a CUDA Python application using the CUPTI Python APIs without having to modify the CUDA Python application code. This sample shows use of CUPTI Activity APIs and Callback APIs. It also shows how to profile a range of code for a CUDA Python application which uses CUDA profiler start and stop APIs.

usage: cupyprof.py [-h] [-p {from_start|range}] [-a <activities>] [-o {brief|detailed|none}] <python_file_path> [args]

Command line options: --help, -h

Shows the usage.

--profile, -p PROFILING_TYPE

Enable profiling for entire CUDA python program, or only for the subset between cuProfilerStart and cuProfilerStop. PROFILING_TYPE can be : from_start or range. Default: from_start

--activity, -a <comma separated list of activities>

Use --help to view the list of supported activities. To know which activities are enabled by default, see default_activity_choices in cupyprof.py.

--output, -o OUTPUT_TYPE

Select the profiler output format. OUTPUT_TYPE can be: brief, detailed, or none. Default: brief

python_file_path is the path to the CUDA Python application, and args are the arguments for the python application.

Examples of running samples#

Run the sample without profiling:

$ python3 CuptiVectorAddNumba.py

Run the sample with profiling enabled and use default output:

$ python3 CuptiVectorAddNumba.py --profile
profiling_enabled:  True
prof_output:  ProfOutput.BRIEF
vector_length:  1048576
threads_per_block:  128
blocks_per_grid:  8192
Activity Kind                  Start                Duration             correlationId        Name
DRIVER                         1714136661470990409  1834876              1                    cuCtxGetCurrent
DRIVER                         1714136661472854473  213                  2                    cuDeviceGetCount
DRIVER                         1714136661472869777  87                   3                    cuDeviceGet
DRIVER                         1714136661472880942  566                  4                    cuDeviceGetAttribute
DRIVER                         1714136661472883507  69                   5                    cuDeviceGetAttribute
DRIVER                         1714136661472906825  3702                 6                    cuDeviceGetName
DRIVER                         1714136661472969577  87                   7                    cuDeviceGetUuid_v2
DRIVER                         1714136661472991812  140587104            8                    cuDevicePrimaryCtxRetain
.
.
.
DRIVER                         1714136661714686225  218                  88                   cuCtxGetCurrent
DRIVER                         1714136661714688211  55                   89                   cuCtxGetDevice
DRIVER                         1714136661714702981  2080                 90                   cuCtxSynchronize
verify_result: PASS

Using the cupyprof.py sample to profile a CUDA Python application with profiling range defined and with detailed output:

$ python3 cupyprof.py --profile range --output detailed ./CuptiVectorAddDrv.py --define-profile-range
profiling_enabled: False
prof_output: ProfOutput.BRIEF
profile_range: True
vector_length: 1048576
threads_per_block: 128
blocks_per_grid: 8192
MEMCPY "HTOD" [ 1726060107808115285, 1726060107808868469 ] duration 753184, size 4194304, src_kind 1, dst_kind 3, correlation_id 2
        device_id 0, context_id 1, stream_id 13, graph_id 0, graph_node_id 0, channel_id 10, channel_type ASYNC_MEMCPY
.
.
.
CONCURRENT_KERNEL [ 1737454707775744135, 1737454707775763143 ] duration 19008, "vector_add", correlation_id 5, cache_config_requested 0, cache_config_executed 0
    grid [8192, 1, 1], block [128, 1, 1], cluster [0, 0, 0], shared_memory (0, 0)
    device_id 0, context_id 1, stream_id 13, graph_id 0, graph_node_id 0, channel_id 1, channel_type COMPUTE
.
.
.
MEMCPY "DTOH" [ 1737455038429091384, 1737455038429825494 ] duration 734110, size 4194304, src_kind 3, dst_kind 1, correlation_id 15
    device_id 0, context_id 1, stream_id 13, graph_id 0, graph_node_id 0, channel_id 12, channel_type ASYNC_MEMCPY

verify_result: PASS

CUPTI Python Tutorial

1. Release Notes