Getting up and running with the AMD GPU Operator and Device Metrics Exporter on Kubernetes is quick and easy. Below is a short guide on how to get started using the helm installation method on a standard Kubernetes install. Note that more detailed instructions on the different installation methods can be found on this site:
GPU Operator Kubernetes Helm Install
GPU Operator Red Hat OpenShift Install
The GPU Operator uses cert-manager to manage certificates for MTLS communication between services. If you haven’t already installed cert-manager as a prerequisite on your Kubernetes cluster, you’ll need to install it as follows:
Once cert-manager is installed, you’re just a few commands away from installing the GPU Operating and having a fully managed GPU infrastructure, add the helm repository and fetch the latest helm charts:
Install the GPU Operator
By using helm install command you can install the AMD GPU Operator helm charts.
Tip
Before v1.3.0 the gpu operator helm chart won’t provide a default DeviceConfig, you need to take extra step to create a DeviceConfig.
Starting from v1.3.0 the helm install command would support one-step installation + configuration, which would create a default DeviceConfig with default values and may not work for all the users with different the deployment scenarios, please refer to Typical Deployment Scenarios for more information and get corresponding helm install commands.
The --version flag is optional, if not specified, the latest version of the GPU Operator will be installed.
The namespace kube-amd-gpu is the default namespace for GPU Operator, you can change it by using the --namespace flag.
Use VM worker node with VF-Passthrough GPU
If you are using VM based GPU worker node with Virtual Function (VF) Passthrough powered by AMD MxGPU GIM driver, the VF device would show up in the guest VM.
You need to adjust the default node selector to "feature.node.kubernetes.io/amd-vgpu":"true" to make the DeviceConfig work for your VM based cluster.
Use GPU worker node without inbox / pre-installed driver
If your worker node doesn’t have inbox / pre-installed AMD GPU driver loaded, the operand (e.g. deivce plugin, metrics exporter) would stuck at Init 0/1 pod state.
If you plan to use GPU Operator to install out-of-tree driver on your worker nodes, please refer to Driver Installation Guide to configure the default DeviceConfig. Here are example commands:
Deploy DeviceConfig separately without using the default one during helm charts installation
You can use the option --set crds.defaultCR.install=false to disable the deployment of the default DeviceConfig then deploy it later in a separate step with your desired configuration.
After running helm install commands with proper configurations in values.yaml. You should now see the GPU Operator pods starting up in the namespace you specified above, kube-amd-gpu. Here is an example of one control plane node and one GPU worker node:
Controller components: gpu-operator-charts-controller-manager, kmm-controller and kmm-webhook-server
Operands: default-device-plugin, default-node-labeller and default-metrics-exporter
Please refer to TroubleShooting if any issue happened during the installation and configuration.
For a full list of DeviceConfig configurable options refer to the Full Reference Config documentation. An example DeviceConfig is supplied in the ROCm/gpu-operator repository:That’s it! The GPU Operator components should now all be running. You can verify this by checking the namespace where the gpu-operator components are installed (default: kube-amd-gpu):
To create a pod that uses a GPU, specify the GPU resource in your pod specification:
Save this YAML to a file (e.g., gpu-pod.yaml) and create the pod:
To check the status of GPUs in your cluster:
To run amd-smi in a pod:
Create a YAML file named amd-smi.yaml:
Create the pod:
Check the logs and verify the output amd-smi reflects the expected ROCm version and GPU presence:
To run rocminfo in a pod:
Create a YAML file named rocminfo.yaml:
Create the pod:
Check the logs and verify the output:
Configuration parameters are documented in the Custom Resource Installation Guide