← 返回首页
Manual/Scheduled Test — AMD GPU Operator 1.5.0
Back to top Ctrl+K

Manual/Scheduled Test#

Manual test trigger#

To start the manual test, directly use the test runner image to create the Kubernetes job and related resources, then the test will be triggered manually.

Warning

The RVS test recipes in the Test Runner are not compatible with partitioned GPUs. If you are using a partitioned GPU, please reset the GPU partition configuration and conduct the manual test on a non-partitioned GPU.

Use Case 1 - GPU is unhealthy on the node#

When any GPU on a specific worker node is unhealthy, you can manually trigger a test / benchmark run on that worker node to check more details on the unhealthy state by mounting the AMD device related files or folders (/dev/dri and /dev/kfd) into the test runner container.

The test job requires RBAC config to grant the test runner access to export events and add node labels to the cluster. Here is an example of configuring the RBAC and Job resources:

apiVersion: v1 kind: ServiceAccount metadata: name: test-run namespace: default --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: test-run-cluster-role rules: - apiGroups: - "" resources: - events verbs: - get - list - watch - create - update - apiGroups: - "" resources: - nodes verbs: - patch - apiGroups: - security.openshift.io resourceNames: - privileged resources: - securitycontextconstraints verbs: - use --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: test-run-rb roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: test-run-cluster-role subjects: - kind: ServiceAccount name: test-run namespace: default --- apiVersion: v1 kind: ConfigMap metadata: name: manual-config-map namespace: default data: config.json: | { "TestConfig": { "GPU_HEALTH_CHECK": { "TestLocationTrigger": { "global": { "TestParameters": { "MANUAL": { "TestCases": [ { "Recipe": "gst_single", "Iterations": 1, "StopOnFailure": true, "TimeoutSeconds": 1200, "Arguments": "--parallel" } ] } } } } } } } --- apiVersion: batch/v1 kind: Job metadata: name: test-runner-manual-trigger namespace: default spec: template: spec: serviceAccountName: test-run nodeSelector: kubernetes.io/hostname: node1 # requesting to run test on node1 volumes: # mount driver related directory and device interface - name: kfd hostPath: path: /dev/kfd type: CharDevice - name: dri hostPath: path: /dev/dri type: Directory - name: config-volume # Config map volume configMap: name: manual-config-map - hostPath: # Specify to use this directory on the host as volume path: /var/log/amd-test-runner type: DirectoryOrCreate name: test-runner-volume containers: - name: amd-test-runner image: docker.io/rocm/test-runner:v1.5.0 imagePullPolicy: IfNotPresent securityContext: # setup security context for container to get access to device related interfaces privileged: true volumeMounts: - mountPath: /dev/dri name: dri - mountPath: /dev/kfd name: kfd - mountPath: /var/log/amd-test-runner # Specify to mount host path volume into specific directory name: test-runner-volume - mountPath: /etc/test-runner/ name: config-volume env: - name: TEST_TRIGGER value: "MANUAL" # Set the TEST_TRIGGER environment variable to MANUAL for manual test - name: POD_NAME # Use downward API to pass pod name to test runner container valueFrom: fieldRef: fieldPath: metadata.name - name: POD_NAMESPACE # Use downward API to pass pod namespace to test runner container valueFrom: fieldRef: fieldPath: metadata.namespace - name: POD_UID # Use downward API to pass pod UID to test runner container valueFrom: fieldRef: fieldPath: metadata.uid - name: NODE_NAME # Use downward API to pass host name to test runner container valueFrom: fieldRef: fieldPath: spec.nodeName restartPolicy: Never backoffLimit: 0 ttlSecondsAfterFinished: 120 # TTL for the job to be auto cleaned

Use Case 2 - GPUs are healthy on the node#

When all GPUs on a worker node are healthy, you can manually trigger a benchmark test by requesting GPU resources (amd.com/gpu) on that node, rather than mounting device files or folders. If other GPU workloads are running and resources are unavailable, the system will wait until enough resources are free before starting the test.

The test job requires RBAC config to grant the test runner access to export events and add node labels to the cluster. Here is an example of configuring the RBAC and Job resources:

apiVersion: v1 kind: ServiceAccount metadata: name: test-run namespace: default --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: test-run-cluster-role rules: - apiGroups: - "" resources: - events verbs: - get - list - watch - create - update - apiGroups: - "" resources: - nodes verbs: - patch - apiGroups: - security.openshift.io resourceNames: - privileged resources: - securitycontextconstraints verbs: - use --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: test-run-rb roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: test-run-cluster-role subjects: - kind: ServiceAccount name: test-run namespace: default --- apiVersion: v1 kind: ConfigMap metadata: name: manual-config-map namespace: default data: config.json: | { "TestConfig": { "GPU_HEALTH_CHECK": { "TestLocationTrigger": { "global": { "TestParameters": { "MANUAL": { "TestCases": [ { "Recipe": "gst_single", "Iterations": 1, "StopOnFailure": true, "TimeoutSeconds": 1200, "Arguments": "--parallel" } ] } } } } } } } --- apiVersion: batch/v1 kind: Job metadata: name: test-runner-manual-trigger namespace: default spec: template: spec: serviceAccountName: test-run nodeSelector: kubernetes.io/hostname: node1 # requesting to run test on node1 volumes: - name: config-volume # Config map volume configMap: name: manual-config-map - hostPath: # Specify to use this directory on the host as volume path: /var/log/amd-test-runner type: DirectoryOrCreate name: test-runner-volume containers: - resources: limits: amd.com/gpu: 8 # requesting all GPUs on the node requests: amd.com/gpu: 8 # requesting all GPUs on the node name: amd-test-runner image: docker.io/rocm/test-runner:v1.5.0 imagePullPolicy: IfNotPresent securityContext: # setup security context for container to get access to device related interfaces privileged: true volumeMounts: - mountPath: /var/log/amd-test-runner # Specify to mount host path volume into specific directory name: test-runner-volume - mountPath: /etc/test-runner/ name: config-volume env: - name: TEST_TRIGGER value: "MANUAL" # Set the TEST_TRIGGER environment variable to MANUAL for manual test - name: POD_NAME # Use downward API to pass pod name to test runner container valueFrom: fieldRef: fieldPath: metadata.name - name: POD_NAMESPACE # Use downward API to pass pod namespace to test runner container valueFrom: fieldRef: fieldPath: metadata.namespace - name: POD_UID # Use downward API to pass pod UID to test runner container valueFrom: fieldRef: fieldPath: metadata.uid - name: NODE_NAME # Use downward API to pass host name to test runner container valueFrom: fieldRef: fieldPath: spec.nodeName restartPolicy: Never backoffLimit: 0 ttlSecondsAfterFinished: 120 # TTL for the job to be auto cleaned

When test is running:

$ kubectl get job NAME STATUS COMPLETIONS DURATION AGE test-runner-manual-trigger Running 0/1 31s 31s $ kubectl get pod NAME READY STATUS RESTARTS AGE test-runner-manual-trigger-fnvhn 1/1 Running 0 65s

When test is completed:

$ kubectl get job NAME STATUS COMPLETIONS DURATION AGE test-runner-manual-trigger Complete 1/1 6m10s 7m21s $ kubectl get pod NAME READY STATUS RESTARTS AGE test-runner-manual-trigger-fnvhn 0/1 Completed 0 7m19s

Scheduled Job#

Furthermore, test runner images can also be utilized in Kubernetes CronJob, which allows the test job to be scheduled in the cluster.

apiVersion: v1 kind: ServiceAccount metadata: name: test-run namespace: default --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: test-run-cluster-role rules: - apiGroups: - "" resources: - events verbs: - get - list - watch - create - update - apiGroups: - "" resources: - nodes verbs: - patch - apiGroups: - security.openshift.io resourceNames: - privileged resources: - securitycontextconstraints verbs: - use --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: test-run-rb roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: test-run-cluster-role subjects: - kind: ServiceAccount name: test-run namespace: default --- apiVersion: v1 kind: ConfigMap metadata: name: manual-config-map namespace: default data: config.json: | { "TestConfig": { "GPU_HEALTH_CHECK": { "TestLocationTrigger": { "global": { "TestParameters": { "MANUAL": { "TestCases": [ { "Recipe": "gst_single", "Iterations": 1, "StopOnFailure": true, "TimeoutSeconds": 1200, "Arguments": "--parallel" } ] } } } } } } } --- apiVersion: batch/v1 kind: CronJob metadata: name: test-runner-manual-trigger-cron-job-midnight spec: # check specific schedule config at https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#writing-a-cronjob-spec schedule: "0 0 * * *" # This schedule runs the job daily at midnight jobTemplate: spec: template: spec: serviceAccountName: test-run nodeSelector: kubernetes.io/hostname: node1 # requesting to run test on node1 volumes: # mount driver related directory and device interface - name: kfd hostPath: path: /dev/kfd type: CharDevice - name: dri hostPath: path: /dev/dri type: Directory - name: config-volume configMap: name: manual-config-map - hostPath: # Specify to use this directory on the host as volume path: /var/log/amd-test-runner type: DirectoryOrCreate name: test-runner-volume containers: - name: init-test-runner image: docker.io/rocm/test-runner:v1.5.0 imagePullPolicy: IfNotPresent securityContext: # setup security context for container to get access to device related interfaces privileged: true volumeMounts: - mountPath: /dev/dri name: dri - mountPath: /dev/kfd name: kfd - mountPath: /var/log/amd-test-runner # Specify to mount host path volume into specific directory name: test-runner-volume - mountPath: /etc/test-runner/ name: config-volume env: - name: TEST_TRIGGER value: "MANUAL" - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: POD_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace - name: POD_UID valueFrom: fieldRef: fieldPath: metadata.uid - name: NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName restartPolicy: Never backoffLimit: 0 ttlSecondsAfterFinished: 120

When the job gets scheduled, the CronJob resource will show active jobs and the job and pod resources will be created.

$ kubectl get cronjob NAME SCHEDULE TIMEZONE SUSPEND ACTIVE LAST SCHEDULE AGE test-runner-manual-trigger-cron-job-midnight 0 0 * * * <none> False 1 2s 86s $ kubectl get job NAME STATUS COMPLETIONS DURATION AGE test-runner-manual-trigger-cron-job-midnight-28936820 Running 0/1 6s 6s $ kubectl get pod NAME READY STATUS RESTARTS AGE test-runner-manual-trigger-cron-job-midnight-28936820-kkqnj 1/1 Running 0 16s

Check test running node labels#

When the test is ongoing the corresponding label will be added to the node resource: "testrunner.amd.com.gpu_health_check.gst_single": "running", the test running label will be removed once the test completed.

Check test result#

Once the test run finishes, the Job’s Status field captures the test run result. If the test run is successful, the status is marked as Complete. In case of failure, status is marked as Failed.

Check test result event#

The test runner generated event can be found from Job resource defined namespace. The event contains more granular details about the test run. It can be used to gather more information in case of failure.

$ kubectl get events -n kube-amd-gpu LAST SEEN TYPE REASON OBJECT MESSAGE 8m8s Normal TestFailed pod/test-runner-manual-trigger-c4hpw [{"number":1,"suitesResult":{"42924":{"gpustress-3000-dgemm-false":"success","gpustress-41000-fp32-false":"failure","gst-1215Tflops-4K4K8K-rand-fp8":"failure","gst-8096-150000-fp16":"success"}}}]

More detailed information about test result events can be found in this section.

Advanced Configuration - ConfigMap#

You can create a config map to customize the test trigger and recipe configs. For the example config map and explanation please check this section.

After creating the config map, you can specify the volume and volume mount to mount the config map into test runner container.

  • In the config map the file name must be named as config.json

  • Within the test runner container the mount path should be /etc/test-runner/

Here is an example of applying the customized config map

apiVersion: v1 kind: ServiceAccount metadata: name: test-run namespace: default --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: test-run-cluster-role rules: - apiGroups: - "" resources: - events verbs: - get - list - watch - create - update - apiGroups: - "" resources: - nodes verbs: - patch - apiGroups: - security.openshift.io resourceNames: - privileged resources: - securitycontextconstraints verbs: - use --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: test-run-rb roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: test-run-cluster-role subjects: - kind: ServiceAccount name: test-run namespace: default --- apiVersion: v1 kind: ConfigMap metadata: name: manual-config-map namespace: default data: config.json: | { "TestConfig": { "GPU_HEALTH_CHECK": { "TestLocationTrigger": { "global": { "TestParameters": { "MANUAL": { "TestCases": [ { "Recipe": "gst_single", "Iterations": 1, "StopOnFailure": true, "TimeoutSeconds": 1200, "Arguments": "--parallel" } ] } } }, "node1": { "TestParameters": { "MANUAL": { "TestCases": [ { "Recipe": "babel", "Iterations": 1, "StopOnFailure": true, "TimeoutSeconds": 1200, "Arguments": "--parallel" } ] } } } } } } } --- apiVersion: batch/v1 kind: Job metadata: name: test-runner-manual-trigger namespace: default spec: template: spec: serviceAccountName: test-run nodeSelector: kubernetes.io/hostname: node1 # requesting to run test on node1 volumes: # mount driver related directory and device interface - name: kfd hostPath: path: /dev/kfd type: CharDevice - name: dri hostPath: path: /dev/dri type: Directory - name: config-volume configMap: name: manual-config-map - hostPath: # Specify to use this directory on the host as volume path: /var/log/amd-test-runner type: DirectoryOrCreate name: test-runner-volume containers: - name: amd-test-runner image: docker.io/rocm/test-runner:v1.5.0 imagePullPolicy: IfNotPresent securityContext: # setup security context for container to get access to device related interfaces privileged: true volumeMounts: - mountPath: /dev/dri name: dri - mountPath: /dev/kfd name: kfd - mountPath: /var/log/amd-test-runner # Specify to mount host path volume into specific directory name: test-runner-volume - mountPath: /etc/test-runner/ name: config-volume env: - name: TEST_TRIGGER value: "MANUAL" # Set the TEST_TRIGGER environment variable to MANUAL for manual test - name: POD_NAME # Use downward API to pass pod name to test runner container valueFrom: fieldRef: fieldPath: metadata.name - name: POD_NAMESPACE # Use downward API to pass pod namespace to test runner container valueFrom: fieldRef: fieldPath: metadata.namespace - name: POD_UID # Use downward API to pass pod UID to test runner container valueFrom: fieldRef: fieldPath: metadata.uid - name: NODE_NAME # Use downward API to pass host name to test runner container valueFrom: fieldRef: fieldPath: spec.nodeName restartPolicy: Never backoffLimit: 0 ttlSecondsAfterFinished: 120 # TTL for the job to be auto cleaned

Advanced Configuration - Logs Mount#

The test runner compressed test logs are saved at /var/log/amd-test-runner/ folder within container. If you don’t configure the logs mount, by default the logs won’t be exported. There are many types of Kubernetes volumes to mount into the test runner container so that the test logs can be persisted at your desired place.

Here is an example of mounting the hostPath into the container, the key points are:

  • Define a hostPath volume

  • Mount the volume to a directory within test runner container

  • Use LOG_MOUNT_DIR environment variable to ask test runner to save logs into the mounted directory

apiVersion: v1 kind: ServiceAccount metadata: name: test-run namespace: default --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: test-run-cluster-role rules: - apiGroups: - "" resources: - events verbs: - get - list - watch - create - update - apiGroups: - "" resources: - nodes verbs: - patch - apiGroups: - security.openshift.io resourceNames: - privileged resources: - securitycontextconstraints verbs: - use --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: test-run-rb roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: test-run-cluster-role subjects: - kind: ServiceAccount name: test-run namespace: default --- apiVersion: v1 kind: ConfigMap metadata: name: manual-config-map namespace: default data: config.json: | { "TestConfig": { "GPU_HEALTH_CHECK": { "TestLocationTrigger": { "global": { "TestParameters": { "MANUAL": { "TestCases": [ { "Recipe": "gst_single", "Iterations": 1, "StopOnFailure": true, "TimeoutSeconds": 1200, "Arguments": "--parallel" } ] } } }, "node1": { "TestParameters": { "MANUAL": { "TestCases": [ { "Recipe": "babel", "Iterations": 1, "StopOnFailure": true, "TimeoutSeconds": 1200, "Arguments": "--parallel" } ] } } } } } } } --- apiVersion: batch/v1 kind: Job metadata: name: test-runner-manual-trigger namespace: default spec: template: spec: serviceAccountName: test-run nodeSelector: kubernetes.io/hostname: node1 # requesting to run test on node1 volumes: # mount driver related directory and device interface - name: kfd hostPath: path: /dev/kfd type: CharDevice - name: dri hostPath: path: /dev/dri type: Directory - name: config-volume configMap: name: manual-config-map - hostPath: # Specify to use this directory on the host as volume path: /var/log/amd-test-runner type: DirectoryOrCreate name: test-runner-volume containers: - name: amd-test-runner image: docker.io/rocm/test-runner:v1.5.0 imagePullPolicy: IfNotPresent securityContext: # setup security context for container to get access to device related interfaces privileged: true volumeMounts: - mountPath: /dev/dri name: dri - mountPath: /dev/kfd name: kfd - mountPath: /var/log/amd-test-runner # Specify to mount host path volume into specific directory name: test-runner-volume - mountPath: /etc/test-runner/ name: config-volume env: - name: LOG_MOUNT_DIR # Use LOG_MOUNT_DIR environment variable to ask test runner to save logs in mounted directory value: /var/log/amd-test-runner - name: TEST_TRIGGER value: "MANUAL" # Set the TEST_TRIGGER environment variable to MANUAL for manual test - name: POD_NAME # Use downward API to pass pod name to test runner container valueFrom: fieldRef: fieldPath: metadata.name - name: POD_NAMESPACE # Use downward API to pass pod namespace to test runner container valueFrom: fieldRef: fieldPath: metadata.namespace - name: POD_UID # Use downward API to pass pod UID to test runner container valueFrom: fieldRef: fieldPath: metadata.uid - name: NODE_NAME # Use downward API to pass host name to test runner container valueFrom: fieldRef: fieldPath: spec.nodeName restartPolicy: Never backoffLimit: 0 ttlSecondsAfterFinished: 120 # TTL for the job to be auto cleaned

Cleanup Manual / Scheduled Test#

When you create the manual or scheduled test resources, it is recommended to put all of them into one YAML file. By running commands like kubectl apply -f xxx.yaml all the related resources will be created. When you want to remove those resources jus run commands kubectl delete -f xxx.yaml to remove those resources from the cluster.

Warning

For the Manual or Scheduled Test run, when you delete the resources that interrupts the test run, you need to double check the node labels to manually remove the test running label like "testrunner.amd.com.gpu_health_check.gst_single": "running".

© 2025 Advanced Micro Devices, Inc