← 返回首页
Manual model conversion on GPU

Documentation

Topics Overview Overview Linux macOS Windows VS Code for the Web Raspberry Pi Network Additional Components Uninstall VS Code Tutorial Copilot Quickstart User Interface Personalize VS Code Install Extensions Tips and Tricks Intro Videos Overview Setup Quickstart Overview Language Models Context Tools Agents Customization Trust & Safety Overview Agents Tutorial Agents Window Planning Memory Tools Subagents Local Agents Copilot CLI Cloud Agents Third-Party Agents Overview Chat Sessions Add Context Inline Chat Review Edits Checkpoints Artifacts Panel Debug Chat Interactions Prompt Examples Overview Instructions Prompt Files Custom Agents Agent Skills Language Models MCP Hooks Plugins Context Engineering Customize AI Test-Driven Development Edit Notebooks with AI Test with AI Test Web Apps with Browser Tools Debug with AI MCP Dev Guide OpenTelemetry Monitoring Inline Suggestions Smart Actions Best Practices Security Troubleshooting FAQ Cheat Sheet Settings Reference MCP Configuration Workspace Context Display Language Layout Keyboard Shortcuts Settings Settings Sync Extension Marketplace Extension Runtime Security Themes Profiles Overview Voice Interactions Command Line Interface Telemetry Basic Editing IntelliSense Code Navigation Refactoring Snippets Overview Multi-Root Workspaces Workspace Trust Tasks Debugging Debug Configuration Testing Port Forwarding Integrated Browser Overview Quickstart Staging & Committing Branches & Worktrees Repositories & Remotes Merge Conflicts Collaborate on GitHub Troubleshooting FAQ Getting Started Tutorial Terminal Basics Terminal Profiles Shell Integration Appearance Advanced Overview Enterprise Policies AI Settings Extensions Telemetry Updates Overview JavaScript JSON HTML Emmet CSS, SCSS and Less TypeScript Markdown PowerShell C++ Java PHP Python Julia R Ruby Rust Go T-SQL C# .NET Swift Working with JavaScript Node.js Tutorial Node.js Debugging Deploy Node.js Apps Browser Debugging Angular Tutorial React Tutorial Vue Tutorial Debugging Recipes Performance Profiling Extensions Tutorial Transpiling Editing Refactoring Debugging Quick Start Tutorial Run Python Code Editing Linting Formatting Debugging Environments Testing Python Interactive Django Tutorial FastAPI Tutorial Flask Tutorial Create Containers Deploy Python Apps Python in the Web Settings Reference Getting Started Navigate and Edit Refactoring Formatting and Linting Project Management Build Tools Run and Debug Testing Spring Boot Modernizing Java Apps Application Servers Deploy Java Apps GUI Applications Extensions FAQ Intro Videos GCC on Linux GCC on Windows GCC on Windows Subsystem for Linux Clang on macOS Microsoft C++ on Windows Build with CMake CMake Tools on Linux CMake Quick Start C++ Dev Tools for Copilot Editing and Navigating Debugging Configure Debugging Refactoring Settings Reference Configure IntelliSense Configure IntelliSense for Cross-Compiling FAQ Intro Videos Get Started Navigate and Edit IntelliCode Refactoring Formatting and Linting Project Management Build Tools Package Management Run and Debug Testing FAQ Overview Node.js Python ASP.NET Core Debug Docker Compose Registries Deploy to Azure Choose a Dev Environment Customize Develop with Kubernetes Tips and Tricks Overview Jupyter Notebooks Data Science Tutorial Python Interactive Data Wrangler Quick Start Data Wrangler PyTorch Support Azure Machine Learning Manage Jupyter Kernels Jupyter Notebooks on the Web Data Science in Microsoft Fabric Foundry Toolkit Overview Foundry Toolkit Copilot Tools Create Agents Models Playground Agent Builder Agent Inspector Evaluation Tool Catalog Fine-Tuning (Automated Setup) Fine-Tuning (Project Template) Model Conversion Tracing Profiling (Windows ML) FAQ File Structure Manual Model Conversion Manual Model Conversion on GPU Setup Environment Without Foundry Toolkit Template Project Migrating from Visualizer to Agent Inspector Overview Getting Started Resources View Deployment VS Code for the Web - Azure Containers Azure Kubernetes Service Kubernetes MongoDB Remote Debugging for Node.js Overview SSH Dev Containers Windows Subsystem for Linux GitHub Codespaces VS Code Server Tunnels SSH Tutorial WSL Tutorial Tips and Tricks FAQ Overview Tutorial Attach to Container Create Dev Container Advanced Containers devcontainer.json Dev Container CLI Tips and Tricks FAQ Default Keyboard Shortcuts Default Settings Substitution Variables Tasks Schema
Copy as Markdown

On this page there are 2 sections

Manual model conversion on GPU

This article introduces the manual workflow for converting LLM models using a local Nvidia GPU. It describes the required environment setup, execution steps, and how to run inference on a Windows Copilot+ PC with a Qualcomm NPU.

Conversion of LLM models requires a Nvidia GPU. If you want model lab to manage your local GPU, follow the steps in Convert Model. Otherwise, follow the steps in this article.

Manual run model conversion on GPU

This workflow is configured using the qnn_config.json file and requires two separate Python environments.

  • The first environment is used for model conversion with GPU acceleration and includes packages like onnxruntime-gpu and AutoGPTQ.
  • The second environment is used for QNN optimization and includes packages like onnxruntime-qnn with specific dependencies.

First environment setup

In a Python 3.10 x64 Python environment with Olive installed, install the required packages:

=1.21.0" "onnxruntime-genai-cuda>=0.6.0" # AutoGPTQ: Install from source (stable package may be slow for weight packing) # Disable CUDA extension build (not required) # Linux export BUILD_CUDA_EXT=0 # Windows # set BUILD_CUDA_EXT=0 # Install AutoGPTQ from source pip install --no-build-isolation git+https://github.com/PanQiWei/AutoGPTQ.git # Please update CUDA version if needed pip install torch --index-url https://download.pytorch.org/whl/cu121

⚠️ Only set up the environment and install the packages. Do not run the olive run command at this point.

Second environment setup

In a Python 3.10 x64 Python environment with Olive installed, install the required packages:

olive run --config qnn_config.json

After completing this command, the optimized model is saved in: ./model/model_name.

⚠️ If optimization fails due to out of memory, please remove calibration_providers in config file.

⚠️ If optimization fails during context binary generation, rerun the command. The process will resume from the last completed step.

Manual run inference samples

The optimized model can be used for inference using ONNX Runtime QNN Execution Provider and ONNX Runtime GenAI. Inference must be run on a Windows Copilot+ PC with a Qualcomm NPU.

Install required packages on arm64 Python environment

Model compilation using QNN Execution Provider requires a Python environment with onnxruntime-qnn installed. In a separate Python environment with Olive installed, install the required packages:

=0.7.0rc2"

Run interface sample

Execute the provided inference_sample.ipynb notebook. Select ipykernel to this arm64 Python environment.

⚠️ If you get a 6033 error, replace genai_config.json in the ./model/model_name folder.