← 返回首页
HPCTrainingExamples/Affinity at main · amd/HPCTrainingExamples · GitHub
Skip to content

Navigation Menu

Toggle navigation
Sign in
Appearance settings
Search or jump to...

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Appearance settings
Resetting focus

Latest commit

 

History

History
 main
Top

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
View all files

README.md

Affinity Exercises

In this set of exercises, we will take two example applications in tandem to show the effect of setting proper affinity to CPU cores and GPUs. Using the hello_jobstep example, we will see how affinity settings could be verified programmatically. Using the ghost exchange example, we will examine how setting affinity properly can help improve performance. We will examine different scenarios incrementally starting from not setting any affinity at all, to setting both CPU core and GPU device affinities. Please note that the hello_jobstep application is built with OpenMP® support while the Ghost Exchange Ver1 example does not have OpenMP support.

Build examples

Follow the instructions in the sections below for cloning and building the two examples.

Build hello_jobstep example

Clone the hello_jobstep code and build it as shown below:

cd ~/git git clone https://code.ornl.gov/olcf/hello_jobstep.git cd hello_jobstep cat >> Makefile.new SOURCES = hello_jobstep.cpp OBJECTS = $(SOURCES:.cpp=.o) EXECUTABLE = hello_jobstep CXX=g++ CXXFLAGS = -fopenmp -I/opt/ompi-5.0.3/include -I${ROCM_PATH}/include -D__HIP_PLATFORM_AMD__ LDFLAGS = -L${ROCM_PATH}/lib -lhsa-runtime64 -lamdhip64 -L/opt/ompi-5.0.3/lib -lmpi all: ${EXECUTABLE} %.o: %.cpp $(CXX) $(CXXFLAGS) -o $@ -c $< $(EXECUTABLE): $(OBJECTS) $(CXX) $(CXXFLAGS) $(OBJECTS) -o $@ $(LDFLAGS) clean: rm -f $(EXECUTABLE) rm -f $(OBJECTS) Ctrl+C make

Build Ghost Exchange (Ver1) example

To build the Ghost Exchange code, we need ROCm to be loaded in your environment and the ROCM_PATH environment variable be set up with the path to your ROCm installation. We will also need a build of OpenMPI (say ver 5.0.3) that uses UCX (say ver 1.16.x) built with ROCm. The following instructions assume that ROCM_PATH is set up, and that OpenMPI is installed at /opt/ompi-5.0.3.

Ver1 of the Ghost Exchange example uses offloading to GPU using the HIP programming model and a managed memory model. Using the managed memory model, the memory buffers are still initially allocated on host, but the OS will manage page migration and data movement across the PCIe link between the host and device. For this to happen when the Ghost Exchange example is run, we need to set up an environment variable, HSA_XNACK=1.

cd ~/git git clone git@github.com:amd/HPCTrainingExamples.git cd HPCTrainingExamples/MPI-examples/GhostExchange/GhostExchange_ArrayAssign/HIP/Ver1 export PATH=/opt/ompi-5.0.3/bin:$PATH export LD_LIBRARY_PATH=/opt/rocm/lib:/opt/ompi-5.0.3/lib:$LD_LIBRARY_PATH export HSA_XNACK=1 mkdir build; cd build; cmake -D CMAKE_CXX_COMPILER=${ROCM_PATH}/bin/amdclang++ -D CMAKE_C_COMPILER=${ROCM_PATH}/bin/amdclang -DCMAKE_PREFIX_PATH=${ROCM_PATH}/lib/cmake/hip .. make

System topology

To understand how to set up affinity, we need to know the topology of the system we have. Two commands help us here. lscpu informs us the CPU hardware thread (HWT) count and how they are configured into NUMA domains. rocm-smi --showtoponuma shows the NUMA configuration for the GPU devices on the system. As an example, here are the details from each command on the node we are using for this tutorial.

The output of lscpu below shows that hardware threads 0-23 and 96-119 belong to NUMA domain 0, 24-47 and 120-143 belong to NUMA domain 1, and so on.

$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 192 On-line CPU(s) list: 0-191 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 4 NUMA node(s): 4 Vendor ID: AuthenticAMD CPU family: 25 Model: 144 Model name: AMD Instinct MI300A Accelerator Stepping: 1 CPU MHz: 3700.000 CPU max MHz: 3700.0000 CPU min MHz: 1500.0000 BogoMIPS: 7399.66 Virtualization: AMD-V L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 32768K NUMA node0 CPU(s): 0-23,96-119 NUMA node1 CPU(s): 24-47,120-143 NUMA node2 CPU(s): 48-71,144-167 NUMA node3 CPU(s): 72-95,168-191

And the output of rocm-smi --showtoponuma shows that GPU device 0 belongs to NUMA domain 0, GPU 1 belongs to NUMA domain 1, etc. With this knowledge, we believe that it would be best if a process using GPU 0 is also pinned to run on a hardware thread in the range 0-23,96-119. The proximity of CPU cores to GPUs is determined by their association with a common NUMA domain.

============================ ROCm System Management Interface ============================ ======================================= Numa Nodes ======================================= GPU[0] : (Topology) Numa Node: 0 GPU[0] : (Topology) Numa Affinity: 0 GPU[1] : (Topology) Numa Node: 1 GPU[1] : (Topology) Numa Affinity: 1 GPU[2] : (Topology) Numa Node: 2 GPU[2] : (Topology) Numa Affinity: 2 GPU[3] : (Topology) Numa Node: 3 GPU[3] : (Topology) Numa Affinity: 3 ================================== End of ROCm SMI Log ===================================

Run 1 Rank / GPU

Let us examine the case where we run only 1 MPI process per GPU device. In each case shown below, we will run both examples using the same settings and observe the output from each one.

Run without affinity settings

In this case, we do not set up any affinity, so we see that each process landed in some HWT. In this case, we see HWTs 001, 121, 144, and 072. We also see all GPUs were available to each process in the RT_GPU_ID list for each one, and there was no assigned GPU for each process because GPU_ID = N/A.

$ cd ~/git/hello_jobstep $ OMP_NUM_THREADS=1 mpirun -np 4 --mca pml ucx --mca coll ^hcoll ./hello_jobstep MPI 002 - OMP 000 - HWT 144 - Node <host> - RT_GPU_ID 0,1,2,3 - GPU_ID N/A - Bus_ID 02,02,02,02 MPI 001 - OMP 000 - HWT 121 - Node <host> - RT_GPU_ID 0,1,2,3 - GPU_ID N/A - Bus_ID 02,02,02,02 MPI 003 - OMP 000 - HWT 072 - Node <host> - RT_GPU_ID 0,1,2,3 - GPU_ID N/A - Bus_ID 02,02,02,02 MPI 000 - OMP 000 - HWT 001 - Node <host> - RT_GPU_ID 0,1,2,3 - GPU_ID N/A - Bus_ID 02,02,02,02

The output for the Ghost Exchange example shows the same thing, and we see that the elapsed time for this run without any affinity settings is 3.55 seconds.

$ cd ~/git/HPCTrainingExamples/MPI-examples/GhostExchange/GhostExchange_ArrayAssign/HIP/Ver1/build $ mpirun -np 4 --mca pml ucx --mca coll ^hcoll ./GhostExchange -x 2 -y 2 -i 20000 -j 20000 -h 2 -t -c -I 100 MPI 003 - HWT 072 - RT_GPU_ID 0,1,2,3 - GPU_ID N/A​ MPI 002 - HWT 145 - RT_GPU_ID 0,1,2,3 - GPU_ID N/A​ MPI 000 - HWT 003 - RT_GPU_ID 0,1,2,3 - GPU_ID N/A​ MPI 001 - HWT 024 - RT_GPU_ID 0,1,2,3 - GPU_ID N/A​ ​ GhostExchange_ArrayAssign_HIP Timing is stencil 2.443530 boundary condition 0.022016 ghost cell 0.032631 total 3.552828

Run with GPU affinity only

Next, we will set up GPU affinity by setting assigning a unique GPU device for each process via the environment variable, ROCR_VISIBLE_DEVICES. We use a script at ~/git/HPCTrainingExamples/MPI-examples/GhostExchange/GhostExchange_ArrayAssign/HIP/set_gpu_device_mi300a.sh to achieve this. Now, we see that process 0 ran on GPU device 0 (see GPU_ID field), and so on.

$ cd ~/git/hello_jobstep $ OMP_NUM_THREADS=1 mpirun -np 4 --mca pml ucx --mca coll ^hcoll ~/git/HPCTrainingExamples/MPI-examples/GhostExchange/GhostExchange_ArrayAssign_HIP/set_gpu_device_mi300a.sh ./hello_jobstep MPI 003 - OMP 000 - HWT 081 - Node <host> - RT_GPU_ID 0 - GPU_ID 3 - Bus_ID 02 MPI 002 - OMP 000 - HWT 145 - Node <host> - RT_GPU_ID 0 - GPU_ID 2 - Bus_ID 02 MPI 000 - OMP 000 - HWT 096 - Node <host> - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID 02 MPI 001 - OMP 000 - HWT 025 - Node <host> - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID 02

We can observe the GPU affinity set up correctly in the Ghost Exchange example too. In this case, we see that the application ran almost 3x faster with an elapsed time of 1.2 seconds.

$ cd ~/git/HPCTrainingExamples/MPI-examples/GhostExchange/GhostExchange_ArrayAssign/HIP/Ver1/build $ mpirun -np 4 --mca pml ucx --mca coll ^hcoll ../../set_gpu_device_mi300a.sh ./GhostExchange -x 2 -y 2 -i 20000 -j 20000 -h 2 -t -c -I 100​ ​ MPI 000 - HWT 097 - RT_GPU_ID 0 - GPU_ID 0​ MPI 003 - HWT 089 - RT_GPU_ID 0 - GPU_ID 3​ MPI 001 - HWT 024 - RT_GPU_ID 0 - GPU_ID 1​ MPI 002 - HWT 048 - RT_GPU_ID 0 - GPU_ID 2​ ​ GhostExchange_ArrayAssign_HIP Timing is stencil 0.629131 boundary condition 0.007776 ghost cell 0.074072 total 1.208454

Run with CPU affinity only

Now, we will attempt to set CPU affinity only.

For the hello_jobstep example that supports OpenMP, we will use OpenMP environment variables, OMP_PLACES and OMP_PROC_BIND to specify placement and binding for each process. Now, we see that when using OMP_PLACES=threads and OMP_PROC_BIND=close, processes 0, 1, 2, and 3 ran on HWTs 0, 24, 48, and 72 respectively. Please note that you may see different behavior on your system, and hence exploring the OpenMP man page for the various settings for these environment variables may become unavoidable.

$ cd ~/git/hello_jobstep $ OMP_NUM_THREADS=1 OMP_PLACES=threads OMP_PROC_BIND=close mpirun -np 4 --mca pml ucx --mca coll ^hcoll ./hello_jobstep MPI 002 - OMP 000 - HWT 048 - Node <host> - RT_GPU_ID 0,1,2,3 - GPU_ID N/A - Bus_ID 02,02,02,02 MPI 003 - OMP 000 - HWT 072 - Node <host> - RT_GPU_ID 0,1,2,3 - GPU_ID N/A - Bus_ID 02,02,02,02 MPI 001 - OMP 000 - HWT 024 - Node <host> - RT_GPU_ID 0,1,2,3 - GPU_ID N/A - Bus_ID 02,02,02,02 MPI 000 - OMP 000 - HWT 000 - Node <host> - RT_GPU_ID 0,1,2,3 - GPU_ID N/A - Bus_ID 02,02,02,02

In the Ghost Exchange case, we need to use a different pinning mechanism for CPU threads as it is not built with OpenMP support. We resort to using OpenMPI binding options as shown below. And we see that setting CPU affinity alone is not sufficient.

$ cd ~/git/HPCTrainingExamples/MPI-examples/GhostExchange/GhostExchange_ArrayAssign/HIP/Ver1/build $ mpirun -np 4 --mca pml ucx --mca coll ^hcoll --map-by ppr:1:socket ./GhostExchange -x 2  -y 2  -i 20000 -j 20000 -h 2 -t -c -I 100 MPI 000 - HWT 000 - RT_GPU_ID 0,1,2,3 - GPU_ID N/A MPI 001 - HWT 024 - RT_GPU_ID 0,1,2,3 - GPU_ID N/A MPI 003 - HWT 072 - RT_GPU_ID 0,1,2,3 - GPU_ID N/A MPI 002 - HWT 048 - RT_GPU_ID 0,1,2,3 - GPU_ID N/A GhostExchange_ArrayAssign_HIP Timing is stencil 2.803742 boundary condition 0.016902 ghost cell 0.052982 total 3.907620

Run with CPU and GPU affinity

Let us set both CPU and GPU affinity now.

$ cd ~/git/hello_jobstep $ OMP_NUM_THREADS=1 OMP_PLACES=threads OMP_PROC_BIND=close mpirun -np 4 --mca pml ucx --mca coll ^hcoll ~/git/HPCTrainingExamples/MPI-examples/GhostExchange/GhostExchange_ArrayAssign_HIP/set_gpu_device_mi300a.sh ./hello_jobstep MPI 002 - OMP 000 - HWT 048 - Node <host> - RT_GPU_ID 0 - GPU_ID 2 - Bus_ID 02 MPI 000 - OMP 000 - HWT 000 - Node <host> - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID 02 MPI 001 - OMP 000 - HWT 024 - Node <host> - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID 02 MPI 003 - OMP 000 - HWT 072 - Node <host> - RT_GPU_ID 0 - GPU_ID 3 - Bus_ID 02

And for Ghost exchange, we use a combination of mpirun binding options and the GPU affinity script. This improves the performance back to what we saw when we pinned the GPU device only. This shows that GPU affinity is very important for this application's performance.

$ cd ~/git/HPCTrainingExamples/MPI-examples/GhostExchange/GhostExchange_ArrayAssign/HIP/Ver1/build $ mpirun -np 4 --mca pml ucx --mca coll ^hcoll --map-by ppr:1:socket ../../set_gpu_device_mi300a.sh ./GhostExchange -x 2  -y 2  -i 20000 -j 20000 -h 2 -t -c -I 100 MPI 000 - HWT 096 - RT_GPU_ID 0 - GPU_ID 0 MPI 002 - HWT 049 - RT_GPU_ID 0 - GPU_ID 2 MPI 001 - HWT 121 - RT_GPU_ID 0 - GPU_ID 1 MPI 003 - HWT 083 - RT_GPU_ID 0 - GPU_ID 3 GhostExchange_ArrayAssign_HIP Timing is stencil 0.651226 boundary condition 0.007590 ghost cell 0.080684 total 1.235013

Oversubscribing processes on GPU devices

Oversubscription of a GPU with multiple processes requires a more complex script that helps set up affinity such that neighboring ranks are closely packed on GPU devices (i.e., ranks 0 and 1 on GPU 0, ranks 2 and 3 on GPU 1, etc.) and use different cores on the NUMA domain that is closest to the selected GPU. The script shown below works perfectly for the hello_jobstep example as using GOMP_CPU_AFFINITY environment variable for setting CPU affinity requires an OpenMP based application.

#!/bin/bash export global_rank=${OMPI_COMM_WORLD_RANK} export local_rank=${OMPI_COMM_WORLD_LOCAL_RANK} export ranks_per_node=${OMPI_COMM_WORLD_LOCAL_SIZE} if [ -z "${NUM_CPUS}" ]; then let NUM_CPUS=96 fi if [ -z "${RANK_STRIDE}" ]; then let RANK_STRIDE=$(( ${NUM_CPUS}/${ranks_per_node} )) fi if [ -z "${OMP_STRIDE}" ]; then let OMP_STRIDE=1 fi if [ -z "${NUM_GPUS}" ]; then let NUM_GPUS=4 fi if [ -z "${GPU_START}" ]; then let GPU_START=0 fi if [ -z "${GPU_STRIDE}" ]; then let GPU_STRIDE=1 fi cpu_list=($(seq 0 95)) let cpus_per_gpu=${NUM_CPUS}/${NUM_GPUS} let cpu_start_index=$(( ($RANK_STRIDE*${local_rank})+${GPU_START}*$cpus_per_gpu )) let cpu_start=${cpu_list[$cpu_start_index]} let cpu_stop=$(($cpu_start+$OMP_NUM_THREADS*$OMP_STRIDE-1)) gpu_list=(0 1 2 3) let ranks_per_gpu=$(((${ranks_per_node}+${NUM_GPUS}-1)/${NUM_GPUS})) let my_gpu_index=$(($local_rank*$GPU_STRIDE/$ranks_per_gpu))+${GPU_START} let my_gpu=${gpu_list[${my_gpu_index}]} export GOMP_CPU_AFFINITY=$cpu_start-$cpu_stop:$OMP_STRIDE export ROCR_VISIBLE_DEVICES=$my_gpu "$@"

Run 2 ranks/GPU with CPU and GPU affinity

$ cd ~/git/hello_jobstep $ OMP_NUM_THREADS=1 mpirun -np 8 --mca pml ucx --mca coll ^hcoll ~/git/HPCTrainingExamples/MPI-examples/GhostExchange/GhostExchange_ArrayAssign_HIP/set_cpu_gpu_mi300a.sh ./hello_jobstep | sort MPI 000 - OMP 000 - HWT 000 - Node <host> - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID 02 MPI 001 - OMP 000 - HWT 012 - Node <host> - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID 02 MPI 002 - OMP 000 - HWT 024 - Node <host> - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID 02 MPI 003 - OMP 000 - HWT 036 - Node <host> - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID 02 MPI 004 - OMP 000 - HWT 048 - Node <host> - RT_GPU_ID 0 - GPU_ID 2 - Bus_ID 02 MPI 005 - OMP 000 - HWT 060 - Node <host> - RT_GPU_ID 0 - GPU_ID 2 - Bus_ID 02 MPI 006 - OMP 000 - HWT 072 - Node <host> - RT_GPU_ID 0 - GPU_ID 3 - Bus_ID 02 MPI 007 - OMP 000 - HWT 084 - Node <host> - RT_GPU_ID 0 - GPU_ID 3 - Bus_ID 02

For Ghost Exchange, we will use a combination of OpenMPI binding options and the packing script to set up GPU affinity:

$ cd ~/git/HPCTrainingExamples/MPI-examples/GhostExchange/GhostExchange_ArrayAssign/HIP/Ver1/build $ OMP_NUM_THREADS=1 mpirun -np 8 --mca pml ucx --mca coll ^hcoll --map-by ppr:2:socket ../../set_cpu_gpu_mi300a.sh ./GhostExchange -x 4  -y 2  -i 20000 -j 20000 -h 2 -t -c -I 100 MPI 003 - HWT 136 - RT_GPU_ID 0 - GPU_ID 1 MPI 004 - HWT 145 - RT_GPU_ID 0 - GPU_ID 2 MPI 005 - HWT 159 - RT_GPU_ID 0 - GPU_ID 2 MPI 002 - HWT 122 - RT_GPU_ID 0 - GPU_ID 1 MPI 006 - HWT 080 - RT_GPU_ID 0 - GPU_ID 3 MPI 007 - HWT 168 - RT_GPU_ID 0 - GPU_ID 3 MPI 001 - HWT 001 - RT_GPU_ID 0 - GPU_ID 0 MPI 000 - HWT 016 - RT_GPU_ID 0 - GPU_ID 0 GhostExchange_ArrayAssign_HIP Timing is stencil 0.467487 boundary condition 0.005841 ghost cell 0.212834 total 1.195145

Run 4 ranks per GPU with CPU and GPU affinity:

For hello_jobstep:

$ cd ~/git/hello_jobstep $ OMP_NUM_THREADS=1 mpirun -np 16 --mca pml ucx --mca coll ^hcoll ~/git/HPCTrainingExamples/MPI-examples/GhostExchange/GhostExchange_ArrayAssign_HIP/set_cpu_gpu_mi300a.sh ~/git/hello_jobstep/hello_jobstep | sort MPI 000 - OMP 000 - HWT 000 - Node <host> - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID 02 MPI 001 - OMP 000 - HWT 006 - Node <host> - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID 02 MPI 002 - OMP 000 - HWT 012 - Node <host> - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID 02 MPI 003 - OMP 000 - HWT 018 - Node <host> - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID 02 MPI 004 - OMP 000 - HWT 024 - Node <host> - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID 02 MPI 005 - OMP 000 - HWT 030 - Node <host> - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID 02 MPI 006 - OMP 000 - HWT 036 - Node <host> - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID 02 MPI 007 - OMP 000 - HWT 042 - Node <host> - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID 02 MPI 008 - OMP 000 - HWT 048 - Node <host> - RT_GPU_ID 0 - GPU_ID 2 - Bus_ID 02 MPI 009 - OMP 000 - HWT 054 - Node <host> - RT_GPU_ID 0 - GPU_ID 2 - Bus_ID 02 MPI 010 - OMP 000 - HWT 060 - Node <host> - RT_GPU_ID 0 - GPU_ID 2 - Bus_ID 02 MPI 011 - OMP 000 - HWT 066 - Node <host> - RT_GPU_ID 0 - GPU_ID 2 - Bus_ID 02 MPI 012 - OMP 000 - HWT 072 - Node <host> - RT_GPU_ID 0 - GPU_ID 3 - Bus_ID 02 MPI 013 - OMP 000 - HWT 078 - Node <host> - RT_GPU_ID 0 - GPU_ID 3 - Bus_ID 02 MPI 014 - OMP 000 - HWT 084 - Node <host> - RT_GPU_ID 0 - GPU_ID 3 - Bus_ID 02 MPI 015 - OMP 000 - HWT 090 - Node <host> - RT_GPU_ID 0 - GPU_ID 3 - Bus_ID 02

And for Ghost Exchange:

$ cd ~/git/HPCTrainingExamples/MPI-examples/GhostExchange/GhostExchange_ArrayAssign/HIP/Ver1/build $ OMP_NUM_THREADS=1 mpirun -np 16 --mca pml ucx --mca coll ^hcoll --map-by ppr:4:socket ../../set_cpu_gpu_mi300a.sh ./GhostExchange -x 4  -y 4  -i 20000 -j 20000 -h 2 -t -c -I 100 MPI 009 - HWT 057 - RT_GPU_ID 0 - GPU_ID 2 MPI 010 - HWT 048 - RT_GPU_ID 0 - GPU_ID 2 MPI 001 - HWT 010 - RT_GPU_ID 0 - GPU_ID 0 MPI 002 - HWT 011 - RT_GPU_ID 0 - GPU_ID 0 MPI 012 - HWT 184 - RT_GPU_ID 0 - GPU_ID 3 MPI 008 - HWT 071 - RT_GPU_ID 0 - GPU_ID 2 MPI 011 - HWT 068 - RT_GPU_ID 0 - GPU_ID 2 MPI 013 - HWT 074 - RT_GPU_ID 0 - GPU_ID 3 MPI 014 - HWT 177 - RT_GPU_ID 0 - GPU_ID 3 MPI 015 - HWT 168 - RT_GPU_ID 0 - GPU_ID 3 MPI 000 - HWT 023 - RT_GPU_ID 0 - GPU_ID 0 MPI 004 - HWT 126 - RT_GPU_ID 0 - GPU_ID 1 MPI 007 - HWT 025 - RT_GPU_ID 0 - GPU_ID 1 MPI 003 - HWT 099 - RT_GPU_ID 0 - GPU_ID 0 MPI 006 - HWT 041 - RT_GPU_ID 0 - GPU_ID 1 MPI 005 - HWT 038 - RT_GPU_ID 0 - GPU_ID 1 GhostExchange_ArrayAssign_HIP Timing is stencil 0.561575 boundary condition 0.028996 ghost cell 0.059794 total 1.095871

Hope this tutorial helped you learn some useful tricks to set affinity properly. Happy pinning!

Footer

© 2026 GitHub, Inc.