AGFHC provides command-line interface for running GPU health checks, delivering consistent PASS/FAIL results with detailed logs and results.json for failures. Tests can be grouped into recipes (e.g., short or extended runs) to cover common scenarios, and updates to AGFHC improve coverage without requiring user changes.
In addition to RVS (ROCm Validation Suite), the test runner also supports AGFHC (AMD GPU Field Health Check) to ensure the health of AMD GPUs in production environments. The test runner image leverages AGFHC in a containerized environment to simplify execution and deployment.
Note
The public test runner image only supports executing RVS test.
The AGFHC toolkit is NOT publicly accessible and requires special authorization. It can be used not only with the test runner but also in various other workflows. For more details, see the Instinct documentation website.
To access the full test runner image, which includes both RVS and the AGFHC toolkit, please contact your AMD representative to complete the authorization process.
To support more than one test framework, the test runner allows you to specify the test framework in the config.json file.
Example Config Map to use AGFHC test framework:
The default framework is RVS if not specified, but you can switch to AGFHC by setting the Framework field to AGFHC in the TestCases section of the config.json. The Recipe field specifies the test suite to run from the specified framework. You can supply additional optional arguments to the test cases using the Arguments field. At present, only 1 testcase can be run at a time.
Please refer to the AGFHC documentation for available test recipes and additional configuration options.
Here is the AGFHC test recipe support matrix and brief introduction to each recipe:
MI300A |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
|||||||||||||||
MI300X |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
MI300X-HF |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
MI308X |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
MI308X-HF |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
MI325X |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
MI350X |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
|||||||
MI355X |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
all_burnin_12h |
A ~12h check across system |
all_burnin_24h |
A ~24h check across system |
all_burnin_4h |
A ~4h check across system |
all_lvl1 |
A ~5m check across system |
all_lvl2 |
A ~10m check across system |
all_lvl3 |
A ~30m check across system |
all_lvl4 |
A ~1h check across system |
all_lvl5 |
A ~2h check across system |
all_perf |
Run all performance based tests |
dma_lvl1 |
A ~5m DMA workload |
dma_lvl2 |
A ~10m DMA workload |
dma_lvl3 |
A ~30m DMA workload |
dma_lvl4 |
A ~1h DMA workload |
gfx_lvl1 |
A ~5m GFX workload |
gfx_lvl2 |
A ~10m GFX workload |
gfx_lvl3 |
A ~30m GFX workload |
gfx_lvl4 |
A ~1h GFX workload |
hbm_burnin_24h |
A ~24h extended hbm test |
hbm_burnin_8h |
A ~8h extended hbm test |
hbm_lvl1 |
A ~5m HBM workload |
hbm_lvl2 |
A ~10m HBM workload |
hbm_lvl3 |
A ~30m HBM workload |
hbm_lvl4 |
A ~1h HBM workload |
hbm_lvl5 |
A ~2h HBM workload |
hsio |
Run all HSIO tests once |
pcie_lvl1 |
A ~5m PCIe workload |
pcie_lvl2 |
A ~10m PCIe workload |
pcie_lvl3 |
A ~30m PCIe workload |
pcie_lvl4 |
A ~1h PCIe workload |
rochpl_isolation |
Run rocHPL on each GPU |
single_pass |
Run all tests once |
thermal |
Verify thermal solution |
xgmi_lvl1 |
A ~5m xGMI workload |
xgmi_lvl2 |
A ~10m xGMI workload |
xgmi_lvl3 |
A ~30m xGMI workload |
xgmi_lvl4 |
A ~1h xGMI workload |
NOTE: Each one of the aforementioned recipes could consist of multiple test cases. Execution of individual AGFHC test case is currently not supported.
The Instinct GPU models could be configured with certain GPU partition profiles to execute AGFHC tests, the supported partition profiles are:
mi300a |
SPX |
NPS1 |
1 |
mi300a |
SPX |
NPS1 |
2 |
mi300a |
SPX |
NPS1 |
4 |
mi300x |
SPX |
NPS1 |
1 |
mi300x |
SPX |
NPS1 |
8 |
mi308x |
SPX |
NPS1 |
1 |
mi308x |
SPX |
NPS1 |
8 |
mi325x |
SPX |
NPS1 |
1 |
mi325x |
SPX |
NPS1 |
8 |
mi308x-hf |
SPX |
NPS1 |
1 |
mi308x-hf |
SPX |
NPS1 |
8 |
mi300x-hf |
SPX |
NPS1 |
1 |
mi300x-hf |
SPX |
NPS1 |
8 |
mi350x |
SPX |
NPS1 |
1 |
mi350x |
SPX |
NPS1 |
8 |
mi355x |
SPX |
NPS1 |
1 |
mi355x |
SPX |
NPS1 |
8 |
As for the AGFHC arguments, please refer to AGFHC official documents for the full list of available arguments. Here is a list of frequently used arguments:
--update-interval UPDATE_INTERVAL |
Set the interval to print elapsed timing updates on the console. |
--update-interval 20s - updates every 20s |
--sysmon-interval SYSMON_INTERVAL |
Set to update the default sysmon interval |
|
--tar-logs |
Generate a tar file of all logs |
|
--disable-sysmon |
Set to disable system monitoring data collection. |
Default: enabled |
--disable-numa-control |
Set to disable control of numa balancing. |
Default: enabled |
--disable-ras-checks |
Set to disable ras checks. |
Default: enabled |
--disable-bad-pages-checks |
Set to disable bad pages checks. |
Default: enabled |
--disable-dmesg-checks |
Set to disable dmesg checks. |
Default: enabled |
--ignore-dmesg |
Set to ignore dmesg fails, logs will still be created. |
Default: dmesg fails enabled |
--ignore-ras |
Set to ignore ras fails, logs will still be created. |
Default: ras fails enabled |
--ignore-performance |
Set to ignore performance to skip the performance analysis and perform only RAS/dmesg checks. |
Default: performance analysis enabled |
--known-dmesg-only |
Do not fail on any unknown dmesg, but mark them as expected. |
Default: any unknown dmesg fails |
--disable-hsio-gather |
Set to disable hsio gather. |
Default: enabled |
--exit-on-failure, -e |
Exits the execution of test cases on failure of a test, marking remaining as skipped. |
Default: keep running without exiting on failure |
Upon successful execution of the AGFHC test recipe, the results are output as Kubernetes events. You can view these events using the following command:
If the test fails, the event will indicate a failure status.
By default, test execution logs are saved to /var/log/amd-test-runner/ on the host. Log export functionality is also supported, similar to RVS. AGFHC provides more detailed logs than RVS and all the logs provided by the framework are included in the tarball.