← 返回首页
CUPTI Python Samples — CUPTI Python
Back to top Ctrl+K

CUPTI Python Samples#

Once CUPTI Python is installed, the CUPTI samples are located under the site-packages/cupti-python-samples directory. You can determine the location of your site-packages directory by executing the following command:

$ python3 -m site

Setting up Numba CUDA#

The CUPTI Python Numba samples require the numba-cuda package along with the dependencies for CUDA 13.0. You can install numba-cuda using the following command:

$ pip install numba-cuda[cu13]

Samples#

The CuptiVectorAdd* samples have a simple code which does element by element vector addition.

CuptiVectorAddNumba.py#

CUPTI Python sample which shows use of CUPTI Activity APIs. This sample uses numba-cuda.

Command line options: --profile, -p

Enable CUPTI based profiling. Default: OFF

--output, -o OUTPUT_TYPE

Select the profiler output format. OUTPUT_TYPE can be: brief, detailed, or none. Default: brief

--help, -h

Shows the usage.

CuptiVectorAddNumbaCallback.py#

CUPTI Python sample which shows use of CUPTI Callback APIs. This sample uses numba-cuda.

Command line options: --profile, -p

Enable CUPTI based profiling. Default: OFF

--output, -o OUTPUT_TYPE

Select the profiler output format. OUTPUT_TYPE can be: brief, detailed, or none. Default: brief

--help, -h

Shows the usage.

CuptiVectorAddDrv.py#

CUPTI Python sample which shows use of CUPTI Activity APIs. This sample uses CUDA Python Driver APIs from cuda-bindings. It also shows how to use CUDA profiler start and stop APIs to define the range of code to be profiled.

This sample uses NVRTC (NVIDIA Runtime Compilation) to compile CUDA kernel code to PTX at runtime. The sample demonstrates:

  • Using cuda.bindings.nvrtc to compile CUDA kernel source code to PTX

  • Using cuda.bindings.driver APIs to load the PTX module and launch kernels

  • Using CUPTI Activity APIs to profile the CUDA operations

For ensuring cuda-bindings is set up correctly along with the necessary CUDA Toolkit (CTK) components (including NVRTC), please refer to the cuda-bindings runtime requirements documentation.

Command line options: --profile, -p

Enable CUPTI based profiling. Default: OFF

--define-profile-range, -r

Include CUDA profiler start and stop APIs to define the range of code to be profiled. Default: OFF

--output, -o OUTPUT_TYPE

Select the profiler output format. OUTPUT_TYPE can be: brief, detailed, or none. Default: brief

--help, -h

Shows the usage.

cupyprof.py#

CUPTI Python sample which shows how to profile a CUDA Python application using the CUPTI Python APIs without having to modify the CUDA Python application code. This sample shows use of CUPTI Activity APIs and Callback APIs. It also shows how to profile a range of code for a CUDA Python application which uses CUDA profiler start and stop APIs.

usage: cupyprof.py [-h] [-p {from_start|range}] [-a <activities>] [-o {brief|detailed|none}] <python_file_path> [args]

Command line options: --help, -h

Shows the usage.

--profile, -p PROFILING_TYPE

Enable profiling for entire CUDA python program, or only for the subset between cuProfilerStart and cuProfilerStop. PROFILING_TYPE can be : from_start or range. Default: from_start

--activity, -a <comma separated list of activities>

Use --help to view the list of supported activities. To know which activities are enabled by default, see default_activity_choices in cupyprof.py.

--output, -o OUTPUT_TYPE

Select the profiler output format. OUTPUT_TYPE can be: brief, detailed, or none. Default: brief

python_file_path is the path to the CUDA Python application, and args are the arguments for the python application.

Examples of running samples#

  1. Run the sample without profiling:

$ python3 CuptiVectorAddNumba.py
  1. Run the sample with profiling enabled and use default output:

$ python3 CuptiVectorAddNumba.py --profile profiling_enabled: True prof_output: ProfOutput.BRIEF vector_length: 1048576 threads_per_block: 128 blocks_per_grid: 8192 Activity Kind Start Duration correlationId Name DRIVER 1714136661470990409 1834876 1 cuCtxGetCurrent DRIVER 1714136661472854473 213 2 cuDeviceGetCount DRIVER 1714136661472869777 87 3 cuDeviceGet DRIVER 1714136661472880942 566 4 cuDeviceGetAttribute DRIVER 1714136661472883507 69 5 cuDeviceGetAttribute DRIVER 1714136661472906825 3702 6 cuDeviceGetName DRIVER 1714136661472969577 87 7 cuDeviceGetUuid_v2 DRIVER 1714136661472991812 140587104 8 cuDevicePrimaryCtxRetain . . . DRIVER 1714136661714686225 218 88 cuCtxGetCurrent DRIVER 1714136661714688211 55 89 cuCtxGetDevice DRIVER 1714136661714702981 2080 90 cuCtxSynchronize verify_result: PASS
  1. Using the cupyprof.py sample to profile a CUDA Python application with profiling range defined and with detailed output:

$ python3 cupyprof.py --profile range --output detailed ./CuptiVectorAddDrv.py --define-profile-range profiling_enabled: False prof_output: ProfOutput.BRIEF profile_range: True vector_length: 1048576 threads_per_block: 128 blocks_per_grid: 8192 MEMCPY "HTOD" [ 1726060107808115285, 1726060107808868469 ] duration 753184, size 4194304, src_kind 1, dst_kind 3, correlation_id 2 device_id 0, context_id 1, stream_id 13, graph_id 0, graph_node_id 0, channel_id 10, channel_type ASYNC_MEMCPY . . . CONCURRENT_KERNEL [ 1737454707775744135, 1737454707775763143 ] duration 19008, "vector_add", correlation_id 5, cache_config_requested 0, cache_config_executed 0 grid [8192, 1, 1], block [128, 1, 1], cluster [0, 0, 0], shared_memory (0, 0) device_id 0, context_id 1, stream_id 13, graph_id 0, graph_node_id 0, channel_id 1, channel_type COMPUTE . . . MEMCPY "DTOH" [ 1737455038429091384, 1737455038429825494 ] duration 734110, size 4194304, src_kind 3, dst_kind 1, correlation_id 15 device_id 0, context_id 1, stream_id 13, graph_id 0, graph_node_id 0, channel_id 12, channel_type ASYNC_MEMCPY verify_result: PASS