← 返回首页
ONNX and DirectML execution provider guide - part 1 - AMD GPUOpenONNX and DirectML execution provider guide - part 1 - AMD GPUOpen
  1. Home
  2. » Learn
  3. » ONNX and DirectML execution provider guide - part 1
Share on Bluesky Share on Mastadon Share on LinkedIn Share on Twitter/X Share on Reddit Share on Facebook Share on Whatsapp Share via Email

ONNX and DirectML execution provider guide - part 1

Originally posted: October 29, 2025
Hui Zhang

Introduction

AMD hardware can process neural networks extremely fast with the ONNX Runtime - DirectML execution provider (EP), but it’s also important to know how to use it correctly.

In this topic, I will discuss a common way to utilize the performance of the ONNX Runtime - DirectML EP and prevent data transfer between CPU and GPU.

1. Problem description

First, let’s start a basic scenario when using the ONNX Runtime - DirectML EP.

  1. Load the input data from memory or disk.
  2. Preprocess the data in some way like NHWC to NCHW on the CPU.
  3. Upload the preprocessed data to the GPU.
  4. Run the inference on the GPU.
  5. Load the result back to the CPU.
  6. Postprocess the data on the CPU and prepare the next inference.

In this scenario, there are many data transfers between the CPU and the GPU that can cause unnecessary latency. A GPU is faster for pre- and post-processing than a CPU and the input data might already be on the GPU, for example, in a gaming scenario.

A much better way would be to build a native inference pipeline to execute all the procedures on the GPU, which makes sense.

But how do you put it into practice with the ONNX Runtime - DirectML EP? For that, you need to dive into DirectX® 12 which is the basis of DirectML.

2. Setting up ONNX Runtime with DirectX 12

Usually, DirectML creates a DirectX 12 device and compute queue by itself. But for native pipelines, end users also need a DirectX 12 device and queue to pre- and post-process the data, so the same DirectX 12 context should be shared by end users and DirectML.

To use the ONNX Runtime - DirectML EP + DirectX 12, there must be a way to create DirectML context with an existing DirectX 12 device and queue that can be reached by the DirectML API.

Ort::SessionOptions sessionOptions;
Ort::GetApi().GetExecutionProviderApi("DML", ORT_API_VERSION, reinterpret_cast<const void**>(&OrtDmlApi));
Microsoft::WRL::ComPtr<ID3D12Device> dx12Device;
Microsoft::WRL::ComPtr<IDMLDevice> dmlDevice;
// Create DML device via DX12 device.
HRESULT hr = DMLCreateDevice1(dx12Device.Get(), DML_CREATE_DEVICE_FLAG_NONE, DML_FEATURE_LEVEL_2_0, IID_PPV_ARGS(&dmlDevice));
if (SUCCEEDED(hr)) {
Microsoft::WRL::ComPtr<ID3D12CommandQueue> commandQueue;
D3D12_COMMAND_QUEUE_DESC desc{};
desc.Flags = D3D12_COMMAND_QUEUE_FLAG_NONE;
desc.Type = D3D12_COMMAND_LIST_TYPE_COMPUTE;
if (SUCCEEDED(hr = dx12Device->CreateCommandQueue(&desc, IID_PPV_ARGS(&commandQueue)))) {
// Enable the DML device in current session.
OrtDmlApi->SessionOptionsAppendExecutionProvider_DML1(sessionOptions, dmlDevice.Get(), commandQueue.Get());
}
}

Now you can create the ONNX Runtime session in the normal way with the Ort::SessionOptions object.

3. Mapping DirectX 12 resource into ONNX Runtime

Now that the ONNX Runtime - DirectML EP + DirectX 12 context is created, let’s talk about how to build the input and output tensor.

As the ONNX Runtime - DirectML EP is based on DirectX 12, there must be a way to create a DirectML tensor value with an existing DirectX 12 resource that can also be reached by the DirectML API.

ID3D12Resource* resource; // resource must be initialized before.
std::vector<int64_t> shape = { 1920, 1080 }; // dummy shape size
ONNXTensorElementDataType type = ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT;
void* dmlAllocatorResource;
Ort::ThrowOnError(ortDmlApi->CreateGPUAllocationFromD3DResource(resource, &dmlAllocatorResource));
// reset this ptr once dml resource isn't needed in the future.
std::unique_ptr<void, std::function<void(void*)>> dmlAllocatorResourceCleanup(dmlAllocatorResource, [ortDmlApi](void* ptr) { ortDmlApi->FreeGPUAllocation(ptr); });
Ort::MemoryInfo memoryInfo("DML", OrtAllocatorType::OrtDeviceAllocator, 0, OrtMemType::OrtMemTypeDefault);
// create the OrtValue as a tensor letting ort know that we own the data buffer
OrtValue* value;
Ort::ThrowOnError(ortApi.CreateTensorWithDataAsOrtValue(
memoryInfo,
dmlAllocatorResource,
static_cast<size_t>(resource->GetDesc().Width),
shape.data(),
shape.size(),
type,
&value));
return { Ort::Value(value), std::move(dmlAllocatorResourceCleanup) };

Now the existing DirectX 12 resource is mapped to an Ort::Value object. You can run the inference via Ort::Session::Run(...) with the input and output Ort::Value containing your DirectX 12 resource.

Although it isn’t necessary, as you already have the reference of the DirectX 12 resource, there is a way to get the DirectX 12 resource from the Ort::Value, and this is useful when you want to get the DirectX 12 resource reference from the Ort::Value created by the ONNX Runtime - DirectML EP itself.

ComPtr<ID3D12Resource> inputResource;
Ort::ThrowOnError(ortDmlApi->GetD3D12ResourceFromAllocation(allocator, inputTensor.GetTensorMutableData<void*>(), &inputResource));

4. Faster preprocessing and postprocessing

As the DirectX 12 resources are now mapped to the Ort::Value, you can now work on how to make preprocessing and postprocessing faster.

This can be achieved by the existing DirectX 12 device and its compute shader queue.

As DirectML doesn’t expose the DirectX 12 command list used by the command queue, it’s hard to use GPU synchronization between preprocessing and inference.

Also, Ort::Session::Run(...) uses CPU synchronization to leverage the difference of each execution provider, so it’s also hard to use GPU synchronization between inference and postprocessing.

There are some limitations of the DirectML API and might cause more CPU workload, which may be addressed in future API updates from Microsoft.

5. Conclusion

Although I’ve covered a lot of theoretical knowledge in this topic, practice makes perfect, so I encourage you to spend more time experimenting with some of the example code available online.

Also the official DirectML NPU sample shows the conversion between the Ort::Value and DirectX 12 resource very well, and although it’s for an NPU, it can be very easily ported to a GPU.

For more information about the execution process, please check out the code comments.

Hui Zhang

Zhang Hui is a Member of Technical Staff in the AMD Devtech team where he focuses on helping developers utilize AMD CPU cores efficiently and make deep learning solutions for AMD AI products.

Related news and technical articles

AMD DGF: An Open Geometry Compression Standard
AMD is partnering with Samsung on a multivendor Vulkan extension for Dense Geometry Format (DGF) to help enable dramatically smaller geometry, reduced memory/latency for ray-traced real‑time 3D, and easier engine integration.
Introducing AMD DGF SuperCompression
AMD DGF SuperCompression (DGFS) cuts DGF geometry file sizes while preserving exact block reconstruction and enabling fast decode to either DGF blocks or conventional meshlets for cross-device deployment.
Introducing MiniDXNN: MLP library for DirectX 12
MiniDXNN is a native HLSL and DirectX 12 library for lightning-fast MLP inference leveraging AMD Radeon™ RX 9000 series matrix cores via cooperative vector APIs, delivering optimized kernels, samples, full source and docs to remove compute interop friction.
How GOALS delivers sustained, competitive esports performance on handheld PCs - part 1
The first part of a developer-first look at how GOALS leverages AMD Ryzen APUs and the ADLX SDK to implement a system that reduces power, fan noise and carbon footprint across legacy and handheld hardware while preserving competitive performance.
How GOALS delivers sustained, competitive esports performance on handheld PCs - part 2
The second part of how GOALS optimizes AMD Ryzen handheld PC gaming performance using AMD FSR Upscaling and Frame Generation, handcrafted device profiles, football-aware animation budgeting, and battery-aware scalability for sustained play.
Welcome to the AMD FSR SDK 2.2, now available on GPUOpen
The AMD FSR™ "Redstone" SDK 2.2 update delivers ML-powered FSR Upscaling 4.1 and FSR Ray Regeneration 1.1 optimized for AMD RDNA™ 4 graphics, enabling higher visual fidelity and performance with analytical fallbacks to scale across handhelds, consoles, and PCs.
Enhancing DirectX Testing with AMD Smoldr
Smoldr is an open-source command-line tool that runs DirectX 12 HLSL shaders from simple text scripts, letting you compile, create resources and pipelines, and dispatch compute or raytracing workloads without writing C++ code.
AMD and Microsoft partner on DirectX ML, DirectStorage, and developer tools at GDC 2026
Microsoft and AMD partnered at GDC to announce powerful new developer technologies for Windows, including DirectStorage 1.4, PIX tools updates, DirectX ML integration, Advanced Shader Delivery, and support for the latest Agility SDK update.

Related videos

Advancing AI in video games with AMD Schola | HTEC Days 2025 - YouTube link
Join Alexander Cann, Lead Developer at Schola, and Mehdi Saeedi, AI Lead at Schola, as they take you through the fascinating world of reinforcement learning (RL) and its transformative impact on gaming. They'll be joined by Gabor Sines, Sr. Fellow Engineer at AMD, as moderator.
GPU Reshape – Modern Shader Instrumentation and Instruction Level Validation (Digital Dragons 2024) – YouTube link
GPU Reshape is, a just-in-time instrumentation framework with instruction level validation of shaders. A deep dive into current validation methodologies, and what the future of instrumentation may hold.
Mesh Shaders – Learning Through Examples (Digital Dragons 2024) – YouTube link
Learn about the new Mesh Shader pipeline which can help to create even more better-looking games.
DirectStorage: Optimizing Load-time and Streaming (GDC 2023 - YouTube link)
Join us for a presentation about DirectStorage and how to integrate it to extract optimal load time and streaming performance.
AMD Ryzen™ Processor Software Optimization (GDC 2023 - YouTube link)
In this talk from learn about AMD Ryzen™ products featuring advanced technologies, including laptop, desktop, and workstation processors.
Game Optimization: Radeon™ Developer Tools on RADV and Steam Deck™ (Vulkanised 2023 - YouTube link)
This talk at Vulkanised 2023 covers how to use the Radeon Developer Tool Suite (RDTS) to optimize games using RADV and Steam Deck.
Memory Management in the APEX Engine - Digital Dragons 2022
This talk is a joint-presentation with Avalanche Studios Group explaining how their in-house APEX Engine manages memory with the help of VMA/D3D12MA.
AMD Ryzen™ Processor Software Optimization (GDC 2022) - YouTube link
Join AMD for an introduction to the AMD Ryzen™ family of processors which power today’s game consoles and PCs.