In Part 1 of the Affinity blog series, we looked at the importance of setting affinity for High Performance Computing (HPC) workloads. In this blog post, our goals are the following:
-
Point to tools that can help you understand the topology of your system
-
Discuss ways to verify if affinity is set up correctly for your run
-
Show you how to set affinity for different types of applications
Please note that these tools and techniques provided here may not work on your system. This may be due to a different Message Passing Interface (MPI) implementation or kernel version or a new system configuration. These notes are here to serve as a reference and we expect you to take away the ideas therein and apply on your system setup.
Understanding system topology
On heterogenous systems, we have different types of resources such as CPUs, GPUs, memory controllers, NICs, etc. There are many tools that you can use to progressively understand the Non-Uniform Memory Access (NUMA) configuration on your system. In this section, we will show examples of tools and what information they can provide to us.
CPU information – lscpu
The lscpu tool, part of the Linux® distribution, can be used to display information about the CPU architecture in an easy-to-read text format. A snippet of the lscpu output is shown below. Some key items to note are the number of sockets on the system, the number of physical cores per socket, the number of hardware threads (HWTs) on each physical core and the NUMA domains configured on the system. Alongside each NUMA domain is a list of physical cores and HWTs that belong to that domain. In this case, there are 2 sockets and 8 NUMA domains, which indicates a NPS4 configuration. The first 4 NUMA domains are in socket 0 and the next 4 NUMA domains in socket 1.
CPU NUMA configuration – numactl
The numactl tool, part of the Linux distribution, can be used to control NUMA policy for processes or shared memory. It can also be used to display information about which CPU cores or HWTs belong to each NUMA domain. A sample of numactl -H output is shown below. Here, observe that distances between NUMA domains in different sockets are larger.
GPU information – rocm-smi
The rocm-smi tool is distributed as part of the ROCm™ stack. rocm-smi can be used to display details about the GPUs on the system. Look at this output of rocm-smi below. It shows that there are 8 AMD Instinct™ MI210 GPUs on the system.
GPU NUMA configuration – rocm-smi —showtoponuma
rocm-smi —showtoponuma can help understand the NUMA binding for each GPU. For instance, here we see that GPU 0 is closest to NUMA node 3 and from the lscpu or numactl output, we know that NUMA node 3 contains the CPU cores/HWTs 48-63,176-191. So, on this system, it is good to run the process that uses GPU 0 on, say, core 48.
Node topology – lstopo
lstopo, part of the hwloc package on Linux, can be used to show the topology of the system in various output formats. In the figure below, we see the two packages representing the two sockets on the node. Each package has 4 NUMA domains.
Zooming in to NUMA node 1 as shown in the figure below, we see that it has 16 CPU cores, 2 HWTs per core, and one GPU connected via the PCIe interface. Observe that 8 physical cores share an L3 cache. Placing threads of the same process close together in the same L3 cache region would improve cache reuse if the data being read by all those threads fits in cache.
Verifying affinity setup
It is always a good idea to check if you have the correct affinity set up before you measure performance. Below is a list of some ways you could do that.
-
A look at top or htop can tell us the CPU HWTs on which processes and their threads are running
-
If using OpenMPI, mpirun —report-bindings can be used to show the selection of cores where each rank may be placed
-
For MPI + OpenMP® programs, you can use the following simple “Hello, World” program from Oak Ridge National Laboratory’s Leadership Computing Facility (OLCF) to check mappings: hello_mpi_omp
-
For MPI + OpenMP + HIP programs, a simple “Hello, World” program with HIP from OLCF can be used to verify mappings: hello_jobstep
-
Example code from Chapter 14 of Bob Robey’s book, Essentials of Parallel Computing, can be used to verify mappings for OpenMP, MPI, and MPI+OpenMP cases
When running jobs using the Slurm batch command, it is good practice to run one of the above hello world programs just before your job with the same slurm configuration to verify affinity settings.
Setting affinity
In simple terms, setting affinity means selecting which CPU cores, GPU(s), and/or other resources a process and its threads will use for this run. For different scenarios, affinity setup may be different. Identifying the different cases and applying the right methods is key. In the sections below, a representative set of techniques are shown for different cases. Please note that each one of these tools may offer more features that you may be able to exploit for your use case. Since the features change often, it is always a good idea to refer to man pages for the respective tools.
Serial, single-threaded applications
If your application runs only on a single CPU core, then a quick way to pin the process is to use the Linux based numactl tool. Here is a simple example of indicating to the OS that the process may be scheduled on either core 1 or 2.
Tip: Since core 0 may be used by the OS to service interrupts and for other system activities, it is better avoid it to reduce variability in your runs.
In the example below, we request that all the memory for this process be allocated on NUMA node 0 only.
Serial, multi-threaded applications
In this section, we consider applications that are multi-threaded using either Pthreads or OpenMP and some tools that are available for setting affinity for such applications.
Setting affinity for Pthreads based applications
numactl may be used for pinning a process and its threads by providing a range of cores to bind to. In the following example, we request running the executable on cores 1-7 and interleaving all memory allocations in NUMA nodes 0 and 1.
Setting affinity for OpenMP based applications
While numactl may be used on multi-threaded applications built with Pthreads or OpenMP, the OpenMP 5.2 standard specifies environment variables that can be combined and used for controlling affinity for OpenMP based multi-threaded applications. The descriptions for some key environment variables are given below.
-
OMP_PLACES indicates hardware resources for placement of the process and its threads. Some examples include abstract names that are implementation dependent such as cores, threads, sockets, L3CACHE or NUMA, or an explicit list of places described by non-negative numbers. Consider the following examples:
-
export OMP_PLACES=threads indicates that each place is a hardware thread (HWT)
-
export OMP_PLACES={0,1} to run the process and its threads on cores 0 and 1
- export OMP_PLACES={0:OMP_NUM_THREADS:2}</code> to run the process and its threads on cores <code>0, 2, 4, ... OMP_NUM_THREADS
-
-
OMP_PROC_BIND indicates how OpenMP threads are bound to resources. We can provide a comma separated list of primary, close, spread or false, to indicate policies for nested levels of parallelism.
-
export OMP_PROC_BIND=close to bind threads close to the primary thread on given places
-
export OMP_PROC_BIND=spread to spread threads evenly on given places
-
export OMP_PROC_BIND=primary to bind threads on the same place as the primary thread
-
export OMP_PROC_BIND=false to disable thread affinity
-
The OpenMP specification contains more details about these environment variables and others.
If the GNU OpenMP implementation is used to build the application, we can also set affinity for the process and its threads using the environment variable, GOMP_CPU_AFFINITY. In the example below, we run a process with 16 threads and bind them to cores 0, 4, 8, .. 60.
Multi-process (MPI), multi-threaded applications
Multi-process applications built with MPI support have several options for process placement, order and binding depending on the MPI implementation that the application is built with or the scheduler used to submit the job.
OpenMPI’s mpirun command offers several affinity related options such as —map-by and —report-bindings. The mpirun man page has extensive documentation for these options and other details. An example of running 4 MPI ranks where each rank has 2 OpenMP threads using OpenMPI 5.0.2 is shown below. Here, we mix mpirun options and OpenMP environment variables to spread the two threads of each rank in a NUMA domain and to keep each rank in its own NUMA domain.
The above example also shows running hello_mpi_omp to verify bindings programmatically. The sample output shown below was obtained on a node with topology similar to that shown in the Node topology – lstopo section of this article. Here, we see that each MPI rank got pinned to two HWTs from a unique NUMA domain.
Note that mpirun automatically binds processes to hardware resources in some cases which may be quite confusing sometimes. We urge you to refer to the mpirun man page when your binding options do not give you the desired outcome and look for any default settings such as the ones described in the OpenMPI 5.0 manual.
The Slurm job scheduler offers a rich set of options to control binding of tasks to hardware resources. See the man page for srun or slurm.conf for documentation of all such options. It is important to note that Slurm configuration differs at each site, so we encourage checking the recommendations from your site’s system administrators.
MPICH does not have many affinity related options but if it is built with Slurm integration, slurm bindings can be used to set affinity at runtime.
Hybrid applications that run on CPU + GPU
In hybrid applications where processes use CPU cores and GPU(s), we need to additionally control the affinity of each process to GPU devices. By default, all processes see all GPU devices. In Part 1 of this blog series, we saw the importance of mapping processes to GPU device(s) and CPU cores in the same NUMA domain. Now, we will look at different ways of achieving that. Our goal is to equip you with different techniques so that you can determine what works on your system.
Applications or libraries such as MPI may be built to use either the HIP runtime or the ROCr runtime, a lower level runtime library in the ROCm stack. Depending on the runtime library used, one of two environment variables may be used to set affinity for GPU devices. For HIP applications, the environment variable HIP_VISIBLE_DEVICES can be used to restrict GPU devices visible to the HIP runtime. In applications that use OpenMP target offload features to offload compute to the GPUs, the environment variable ROCR_VISIBLE_DEVICES can be used to restrict GPU devices visible to the ROCR runtime. Since the HIP runtime depends on the ROCr runtime, note that only the subset of GPU devices made visible by ROCR_VISIBLE_DEVICES may be restricted further by HIP_VISIBLE_DEVICES.
In some HPC sites, convenient slurm bindings exist to make such affinity control effortless. The Frontier User Guide offers very good documentation for mapping processes and their threads to CPUs, GPUs, and NICs on Frontier using Slurm options.
In some cases, when the CPU binding is achieved with OpenMPI and OpenMP options as shown in the previous section, setting GPU affinity may be accomplished using a wrapper script that is run by mpirun just before the actual executable. To give you an idea, a simple wrapper script that uses local rank IDs to map 8 processes to the 8 GPU devices on the node is shown below:
Now, taking the same example as in the section above of running 8 MPI ranks on a node with 8 GPUs, each rank running 2 OpenMP threads, we can map both CPUs and GPUs using the command:
Note that this time, we ran the hello_jobstep program to verify CPU and GPU bindings. The output of the above command shows that each rank uses a different GPU device:
A slightly extended script can be found in the AMD lab notes repository which sets up both CPU and GPU affinity for more general cases such as running on multiple nodes, packing multiple ranks on each GPU device, striding the OpenMP threads or striding the MPI ranks so as to spread processes evenly across available sockets. Since this script manages both CPU and GPU affinity settings, it is important to use the mpirun —bind-to none option instead of using the OMP_PROC_BIND setting or the —map-by mpirun option. This script can be easily extended to work with other MPI environments such as MPICH or Slurm. Here is an example of running 16 MPI ranks with 8 threads per rank and 8 GPUs on the node, so each GPU is oversubscribed with 2 MPI ranks:
In the partial output shown below, we observe that MPI ranks 0 and 1 run on GPU 0, MPI ranks 2 and 3 run on GPU 1, etc. Ranks 0 and 1 are packed closely in NUMA domain 0 (cores 0-15) with one physical core per thread. With NPS4 configuration such as in this node, to get the full CPU memory bandwidth, we need to spread processes and threads evenly across the two sockets.
Conclusion
Affinity is an important consideration for achieving better performance from HPC applications. Setting affinity involves understanding the topology of the system and knowing how to control mapping for the application and system at hand. In this blog, we point the reader to several tools and techniques for studying the system and controlling affinity, placement and order accordingly. By no means is this article a comprehensive authority on all possible ways to control affinity, but rather a small sample of what is possible. We stress the importance of reading man pages for all these tools to figure out what would work on your system. We encourage you to participate in Github discussions if you have any questions or comments.
References
-
Frontier User Guide, Oak Ridge Leadership Compute Facility, Oak Ridge National Laboratory (ORNL)
-
Parallel and High Performance Computing, Robert Robey and Yuliana Zamora, Manning Publications, May 2021
-
Performance Analysis of CP2K Code for Ab Initio Molecular Dynamics on CPUs and GPUs, Dewi Yokelson, Nikolay V. Tkachenko, Robert Robey, Ying Wai Li, and Pavel A. Dub, Journal of Chemical Information and Modeling 2022 62 (10), 2378-2386, DOI: 10.1021/acs.jcim.1c01538
-
Code Examples from ORNL:
Disclaimers
The OpenMP name and the OpenMP logo are registered trademarks of the OpenMP Architecture Review Board.
HPE is a registered trademark of Hewlett Packard Enterprise Company and/or its affiliates.
Linux is the registered trademark of Linus Torvalds in the U.S. and other countries.