View all files | ||||
LLM model quantization (compression) toolkit with hw acceleration support for NVIDIA CUDA, AMD ROCm, Huawei Ascend NPU, Intel XPU, and Intel/AMD/Apple CPUs via HF, vLLM, and SGLang.
12/31/2025 5.7.0-dev: ✨New FailSafe config and FailSafeStrategy, auto-enabled by default, to address uneven routing of MoE experts resulting in quantization issues for some MoE modules. Smooth operations are introduced to FailSafeStrategy to reduce the impact of outliers in FailSafe quantization using RTN by default. Different FailSafeStrategy and Smoothers can be selected. Threshold to activate FailSafe can also be customized. New Voxtral and Glm-4v model support, plus audio dataset calibration for Qwen2-Omni. AWQ compatibility fix for GLM 4.5-Air.
12/17/2025 5.6.2-12 Patch: Fixed uv compatibility. Both uv and pip installs will now show UI progress for external wheel/dependency downloads. Fixed macOS and AWQMarlin kernel loading import regressions. Resolved most multi-arch compile issues on Ubuntu, Arch, RedHat and other distros. Fixed multi-arch build issues and Tritonv2 kernel launch bug on multi-GPUs. Fixed 3-bit Triton GPTQ kernel dequant/inference and license property compatibility issue with latest pip/setuptools.
12/9/2025 5.6.0: ✨New HF Kernel for CPU optimized for AMX, AVX2 and AVX512. Auto module tree for auto-model support. Added AfMoE and Dots1 model support. Fixed pre-layer pass quantization speed regression. Improved HF Transformers, PEFT and Optimum support for both GPTQ and AWQ. Fixed many AWQ compatibility bugs and regressions.
11/9/2025 5.4.0: ✨New Intel CPU and XPU hardware-optimized AWQ TorchFusedAWQ kernel. Torch Fused kernels now compatible with torch.compile. Fixed AWQ MoE model compatibility and reduced VRAM usage.
11/3/2025 5.2.0: ✨MiniMax M2 support with ModelCloud BF16 M2 Model. New VramStrategy.Balanced quantization property for reduced memory usage for large MoE on multi-3090 (24GB) devices. ✨Marin model. New AWQ Torch reference kernel. Fixed AWQ Marlin kernel for bf16. Fixed GLM 4.5/4.6 MoE missing mtp layers on model save (HF bug). Modular refactor. 🎉AWQ support out of beta with full feature support including multi-GPU quant and MoE VRAM saving. ✨Brumby (attention free) model support. ✨IBM Granite Nano support. New calibration_concat_separator config option.
10/24/2025 5.0.0: 🎉 Data-parallel quant support for MoE models on multi-GPU using nogil Python. offload_to_disk support enabled by default to massively reduce CPU RAM usage. New Intel and AMD CPU hardware-accelerated TorchFused kernel. Packing stage is now 4x faster and now inlined with quantization. VRAM pressure for large models reduced during quantization. act_group_aware is 16k+ times faster and now the default when desc_act=False for higher quality recovery without inference penalty of desc_act=True. New beta quality AWQ support with full gemm, gemm_fast, marlin kernel support. LFM, Ling, Qwen3 Omni model support. Bitblas kernel updated to support Bitblas 0.1.0.post1 release. Quantization is now faster with reduced VRAM usage. Enhanced logging support with LogBar.
09/16/2025 4.2.5: hyb_act renamed to act_group_aware. Removed finicky torch import within setup.py. Packing bug fix and prebuilt PyTorch 2.8 wheels.
09/12/2025 4.2.0: ✨ New Models Support: Qwen3-Next, Apertus, Kimi K2, Klear, FastLLM, Nemotron H. New fail_safe boolean toggle to .quantize() to patch-fix non-activated MoE modules due to highly uneven MoE model training. Fixed LavaQwen2 compatibility. Patch-fixed GIL=0 CUDA error for multi-GPU. Fixed compatibility with autoround + new transformers.
09/04/2025 4.1.0: ✨ Meituan LongCat Flash Chat, Llama 4, GPT-OSS (BF16), and GLM-4.5-Air support. New experimental mock_quantization config to skip complex computational code paths during quantization to accelerate model quant testing.
08/21/2025 4.0.0: 🎉 New Group Aware Reordering (GAR) support. New models support: Bytedance Seed-OSS, Baidu Ernie, Huawei PanGu, Gemma3, Xiaomi Mimo, Qwen 3/MoE, Falcon H1, GPT-Neo. Memory leak and multiple model compatibility fixes related to Transformers >= 4.54. Python >= 3.13t free-threading support added with near N x GPU linear scaling for quantization of MoE models and also linear N x CPU core scaling of packing stage. Early access PyTorch 2.8 fused-ops on Intel XPU for up to 50% speedup.
10/17/2025 5.0.0-dev main: 👀: EoRA now multi-GPU compatible. Fixed both quality stability in multi-GPU quantization and VRAM usage. New LFM and Ling models support.
09/30/2025 5.0.0-dev main: 👀: New Data Parallel + Multi-GPU + Python 3.13T (PYTHON_GIL=0) equals 80%+ overall quant time reduction of large MoE models vs v4.2.5.
09/29/2025 5.0.0-dev main: 🎉 New Qwen3 Omni model support. AWQ Marlin kernel integrated + many disk offload, threading, and memory usage fixes.
09/24/2025 5.0.0-dev main: 🎉 Up to 90% CPU memory saving for large MoE models with faster/inline packing! 26% quant time reduction for Qwen3 MoE! AWQ Marlin kernel added. AWQ Gemm loading bug fixes. act_group_aware now faster and auto enabled for GPTQ when desc_act is False for higher quality recovery.
09/19/2025 5.0.0-dev main: 👀 CPU memory saving of ~73.5% during quantization stage with new offload_to_disk quantization config property defaults to True.
09/18/2025 5.0.0-dev main: 🎉 AWQ quantization support! Complete refactor and simplification of model definitions in preparation for future quantization formats.
08/19/2025 4.0.0-dev main: Fixed quantization memory usage due to some models' incorrect application of config.use_cache during inference. Fixed Transformers >= 4.54.0 compatibility which changed layer forward return signature for some models.
08/18/2025 4.0.0-dev main: GPT-Neo model support. Memory leak fix in error capture (stack trace) and fixed lm_head quantization compatibility for many models.
07/31/2025 4.0.0-dev main: New Group Aware Reordering (GAR) support and preliminary PyTorch 2.8 fused-ops for Intel XPU for up to 50% speedup.
07/03/2025 4.0.0-dev main: New Baidu Ernie and Huawei PanGu model support.
07/02/2025 4.0.0-dev main: Gemma3 4B model compatibility fix.
05/29/2025 4.0.0-dev main: Falcon H1 model support. Fixed Transformers 4.52+ compatibility with Qwen 2.5 VL models.
05/19/2025 4.0.0-dev main: Qwen 2.5 Omni model support.
05/05/2025 4.0.0-dev main: Python 3.13t free-threading support added with near N x GPU linear scaling for quantization of MoE models and also linear N x CPU core scaling of packing stage.
04/29/2025 3.1.0-dev (Now 4.) main: Xiaomi Mimo model support. Qwen 3 and 3 MoE model support. New arg for quantize(..., calibration_dataset_min_length=10) to filter out bad calibration data that exists in public dataset (wikitext).
04/13/2025 3.0.0: 🎉 New experimental v2 quantization option for improved model quantization accuracy validated by GSM8K_PLATINUM benchmarks vs original gptq. New Phi4-MultiModal model support. New Nvidia Nemotron-Ultra model support. New Dream model support. New experimental multi-GPU quantization support. Reduced VRAM usage. Faster quantization.
04/2/2025 2.2.0: New Qwen 2.5 VL model support. New samples log column during quantization to track module activation in MoE models. Loss log column now color-coded to highlight modules that are friendly/resistant to quantization. Progress (per-step) stats during quantization now streamed to log file. Auto bfloat16 dtype loading for models based on model config. Fixed kernel compile for PyTorch/ROCm. Slightly faster quantization and auto-resolve some low-level OOM issues for smaller VRAM GPUs.
03/12/2025 2.1.0: ✨ New QQQ quantization method and inference support! New Google Gemma 3 zero-day model support. New Alibaba Ovis 2 VL model support. New AMD Instella zero-day model support. New GSM8K Platinum and MMLU-Pro benchmarking support. Peft Lora training with GPT-QModel is now 30%+ faster on all GPU and IPEX devices. Auto-detect MoE modules not activated during quantization due to insufficient calibration data. ROCm setup.py compatibility fixes. Optimum and Peft compatibility fixes. Fixed Peft bfloat16 training.
03/03/2025 2.0.0: 🎉 GPTQ quantization internals are now broken into multiple stages (processes) for feature expansion. Synced Marlin kernel inference quality fix from upstream. Added reduced-precision Marlin accumulation mode via environment control (GPTQMODEL_MARLIN_USE_FP32=0 disables it, default is enabled). ModelScope support added. Logging and CLI progress bar output has been revamped with sticky bottom progress. Fixed generation_config.json save and load. Fixed Transformers v4.49.0 compatibility. Fixed compatibility of models without bos. Fixed group_size=-1 and bits=3 packing regression. Fixed Qwen 2.5 MoE regressions. Added CI tests to track regression in kernel inference quality and sweep all bits/group_sizes. Delegate logging/progress bar to LogBar package. Fixed ROCm version auto-detection in setup install.
02/12/2025 1.9.0: ⚡ Offload tokenizer fixes to Toke(n)icer package. Optimized lm_head quant time and VRAM usage. Optimized DeepSeek v3/R1 model quant VRAM usage. Fixed Optimum compatibility regression in v1.8.1. 3x speed-up for Torch kernel when using PyTorch >= 2.5.0 with model.optimize(). New calibration_dataset_concat_size option to enable calibration data concat mode to mimic original GPTQ data packing strategy which may improve quant speed and accuracy for datasets like wikitext2.
02/08/2025 1.8.1: ⚡ DeepSeek v3/R1 model support. New flexible weight packing: allow quantized weights to be packed to [int32, int16, int8] dtypes.
Triton and Torch kernels support full range of new QuantizeConfig.pack_dtype.
New auto_gc: bool control in quantize() which can reduce quantization time for small model with no chance of OOM.
New buffered_fwd: bool control in model.quantize(). Over 50% quantization speed-up for visual (vl) models.
Fixed bits=3 packing and group_size=-1 regression in v1.7.4.
01/26/2025 1.7.4: New compile() API for ~4-8% inference TPS improvement. Faster pack() for post-quantization model save. Triton kernel validated for Intel/XPU when Intel Triton packages are installed. Fixed Transformers (bug) downcasting tokenizer class on save.
01/20/2025 1.7.3: New Telechat2 (China Telecom) and PhiMoE model support. Fixed lm_head weights duplicated in post-quantize save() for models with tied-embedding.
01/19/2025 1.7.2: Effective BPW (bits per weight) will now be logged during load(). Reduce loading time on Intel Arc A770/B580 XPU by 3.3x. Reduce memory usage in MLX conversion and fix Marlin kernel auto-select not checking CUDA compute version.
01/17/2025 1.7.0: 👀 ✨ backend.MLX added for runtime-conversion and execution of GPTQ models on Apple's MLX framework on Apple Silicon (M1+). ✨ lm_head quantization now fully supported by GPT-QModel without external pkg dependency.
01/07/2025 1.6.1: 🎉 New OpenAI API compatible endpoint via model.serve(host, port). Auto-enable flash-attention2 for inference. Fixed sym=False loading regression.
01/06/2025 1.6.0: ⚡25% faster quantization. 35% reduction in VRAM usage vs v1.5. 👀 AMD ROCm (6.2+) support added and validated for 7900XT+ GPU. Auto-tokenizer loader via load() API. For most models you no longer need to manually init a tokenizer for both inference and quantization.
01/01/2025 1.5.1: 🎉 2025! Added QuantizeConfig.device to clearly define which device is used for quantization: default = auto. Non-quantized models are always loaded on CPU by-default and each layer is moved to QuantizeConfig.device during quantization to minimize VRAM usage. Compatibility fixes for attn_implementation_autoset in latest transformers.
12/23/2024 1.5.0: Multi-modal (image-to-text) optimized quantization support has been added for Qwen 2-VL and Ovis 1.6-VL. Previous image-to-text model quantizations did not use image calibration data, resulting in less than optimal post-quantization results. Version 1.5.0 is the first release to provide a stable path for multi-modal quantization: only text layers are quantized.
12/19/2024 1.4.5: Windows 11 support added/validated. Ovis VL model support with image dataset calibration. Fixed dynamic loading. Reduced quantization VRAM usage.
12/15/2024 1.4.2: macOS GPU (Metal) and CPU (M+) support added/validated for inference and quantization. Cohere 2 model support added.
12/13/2024 1.4.1: Added Qwen2-VL model support. mse quantization control exposed in QuantizeConfig. Monkey patch patch_vllm() and patch_hf() API added to allow Transformers/Optimum/PEFT and vLLM to correctly load GPT-QModel quantized models while upstream PRs are in pending status.
12/10/2024 1.4.0 EvalPlus harness integration merged upstream. We now support both the legacy evaluation harness and EvalPlus. Added pure torch Torch kernel. Refactored Cuda kernel to be DynamicCuda kernel. Triton kernel now auto-padded for max model support. Dynamic quantization now supports both positive +::default, and -: negative matching which allows matched modules to be skipped entirely for quantization. Fixed auto-Marlin kernel selection. Added auto-kernel fallback for unsupported kernel/module pairs. Lots of internal refactor and cleanup in preparation for transformers/optimum/peft upstream PR merge. Deprecated the saving of Marlin weight format since Marlin supports auto conversion of gptq format to Marlin during runtime.
11/29/2024 1.3.1 Olmo2 model support. Intel XPU acceleration via IPEX. Model sharding Transformer compatibility fix due to API deprecation in HF. Removed triton dependency. Triton kernel now optionally dependent on triton package.
11/26/2024 1.3.0 Zero-Day Hymba model support. Removed tqdm and rogue dependency.
11/24/2024 1.2.3 HF GLM model support. ClearML logging integration. Use device-smi and replace gputil + psutil dependencies. Fixed model unit tests.
11/11/2024 🚀 1.2.1 Meta MobileLLM model support added. legacy evaluation integration merged upstream. Intel/IPEX CPU inference merged replacing QBits (deprecated). Auto-fix/patch ChatGLM-3/GLM-4 compatibility with latest transformers. New .load() and .save() API.
10/29/2024 🚀 1.1.0 IBM Granite model support. Full auto-buildless wheel install from PyPI. Reduce max CPU memory usage by >20% during quantization. 100% CI model/feature coverage.
10/12/2024 ✨ 1.0.9 Move AutoRound to optional and fix pip install regression in v1.0.8.
10/11/2024 ✨ 1.0.8 Add wheel for Python 3.12 and CUDA 11.8.
10/08/2024 ✨ 1.0.7 Fixed Marlin (faster) kernel was not auto-selected for some models.
09/26/2024 ✨ 1.0.6 Fixed Llama 3.2 vision quantized loader.
09/26/2024 ✨ 1.0.5 Partial Llama 3.2 Vision model support (mllama): only text-layer quantization layers are supported for now.
09/26/2024 ✨ 1.0.4 Integrated Liger Kernel support for ~1/2 memory reduction on some models during quantization. Added control toggle to disable parallel packing.
09/18/2024 ✨ 1.0.3 Added Microsoft GRIN-MoE and MiniCPM3 support.
08/16/2024 ✨ 1.0.2 Support Intel/AutoRound v0.3, prebuilt whl packages, and PyPI release.
08/14/2024 ✨ 1.0.0 40% faster packing, fixed Python 3.9 compatibility, added evaluation API.
08/10/2024 🚀 0.9.11 Added LG EXAONE 3.0 model support. New dynamic per layer/module flexible quantization where each layer/module may have different bits/params. Added proper sharding support to backend.BITBLAS. Auto-heal quantization errors due to small damp values.
07/31/2024 🚀 0.9.10 Ported vllm/nm gptq_marlin inference kernel with expanded bits (8bits), group_size (64,32), and desc_act support for all GPTQ models with FORMAT.GPTQ. Auto-calculate auto-round nsamples/seglen parameters based on calibration dataset. Fixed save_quantized() called on pre-quantized models with non-supported backends. HF transformers dependency updated to ensure Llama 3.1 fixes are correctly applied to both quant and inference.
07/25/2024 🚀 0.9.9: Added Llama-3.1 support, Gemma2 27B quant inference support via vLLM, auto pad_token normalization, fixed auto-round quant compatibility for vLLM/SGLang, and more.
07/13/2024 🚀 0.9.8: Run quantized models directly using GPT-QModel with fast vLLM or SGLang backend! Both vLLM and SGLang are optimized for dynamic batching inference for maximum TPS (check usage under examples). Marlin backend also got full end-to-end in/out features padding to enhance current/future model compatibility.
07/08/2024 🚀 0.9.7: InternLM 2.5 model support added.
07/08/2024 🚀 0.9.6: Intel/AutoRound QUANT_METHOD support added for a potentially higher quality quantization with lm_head module quantization support for even more VRAM reduction: format export to FORMAT.GPTQ for max inference compatibility.
07/05/2024 🚀 0.9.5: CUDA kernels have been fully deprecated in favor of Exllama(v1/v2)/Marlin/Triton.
07/03/2024 🚀 0.9.4: HF Transformers integration added and bug fixed Gemma 2 support.
07/02/2024 🚀 0.9.3: Added Gemma 2 support, faster quality/benchmark calculations on GPU, and more code/arg refactor.
06/30/2024 🚀 0.9.2: Added auto-padding of model in/out-features for exllama and exllama v2. Fixed quantization of OPT and DeepSeek V2-Lite models. Fixed inference for DeepSeek V2-Lite.
06/29/2024 🚀 0.9.1: With 3 new models (DeepSeek-V2, DeepSeek-V2-Lite, DBRX Converted), BITBLAS new format/kernel, proper batching of calibration dataset resulting > 50% quantization speedup, security hash check of loaded model weights, tons of refactor/usability improvements, bug fixes, and much more.
06/20/2924 ✨ 0.9.0: Thanks for all the work from ModelCloud team and the open-source ML community for their contributions!
PrismAI/Bonsai inference sample script. GPT-QModel loads Prism/Bonsai GGUF checkpoints through its native GGUF loading path and internal GGUF runtime shim. No external gguf PyPI package is required.
GPT-QModel is a production-ready LLM model compression/quantization toolkit with hw-accelerated inference support for both CPU/GPU via HF Transformers, vLLM, and SGLang.
GPT-QModel currently supports GPTQ, AWQ, ParoQuant, QQQ, GGUF, FP8, EXL3, GPTAQ, EoRa, GAR and FOEM, with more quantization methods and enhancements planned.
GPT-QModel is a modular design supporting multiple quantization methods and feature extensions.
| GPTQ | ✅ | ✅ | ✅ | ✅ | ✅ |
| AWQ | ✅ | ✅ | ✅ | ✅ | ✅ |
| ParoQuant | ✅ | x | x | x | ✅ |
| GGUF | ✅ | x | x | x | x |
| FP8 | ✅ | x | x | x | x |
| Exllama V3 / EXL3 | ✅ | x | x | x | x |
| EoRA | ✅ | ✅ | ✅ | ✅ | x |
| Group Aware Act Reordering | ✅ | ✅ | ✅ | ✅ | ✅ |
| QQQ | ✅ | x | x | x | x |
| Rotation | ✅ | x | x | x | x |
| GPTAQ | ✅ | ✅ | ✅ | ✅ | ✅ |
| FOEM | ✅ | ✅ | ✅ | ✅ | ✅ |
GGUF, FP8, EXL3, and ParoQuant are currently native GPT-QModel quantization/runtime paths. vLLM and SGLang integration currently targets GPTQ and AWQ.
Canonical backend names are shown below. Method-specific aliases are only accepted where explicitly implemented by that quant method.
| METHOD.GPTQ | FORMAT.GPTQ, FORMAT.GPTQ_V2, FORMAT.MARLIN, FORMAT.BITBLAS | FORMAT.GPTQ: BACKEND.GPTQ_TORCH_ATEN, BACKEND.GPTQ_MACHETE, BACKEND.GPTQ_MARLIN, BACKEND.GPTQ_EXLLAMA_V2, BACKEND.GPTQ_TORCH_FUSED, BACKEND.GPTQ_TRITON, BACKEND.GPTQ_BITBLAS, BACKEND.GPTQ_TORCH, BACKEND.GPTQ_TORCH_INT8 FORMAT.GPTQ_V2: BACKEND.GPTQ_TORCH_ATEN, BACKEND.GPTQ_EXLLAMA_V2, BACKEND.GPTQ_TORCH_FUSED, BACKEND.GPTQ_TRITON, BACKEND.GPTQ_BITBLAS, BACKEND.GPTQ_TORCH, BACKEND.GPTQ_TORCH_INT8 FORMAT.MARLIN: BACKEND.GPTQ_MARLIN FORMAT.BITBLAS: BACKEND.GPTQ_BITBLAS |
| METHOD.AWQ | FORMAT.GEMM, FORMAT.GEMV, FORMAT.GEMV_FAST, FORMAT.LLM_AWQ, FORMAT.MARLIN, FORMAT.BITBLAS | FORMAT.GEMM: BACKEND.AWQ_TORCH_ATEN, BACKEND.AWQ_MACHETE, BACKEND.AWQ_MARLIN, BACKEND.AWQ_EXLLAMA_V2, BACKEND.AWQ_GEMM, BACKEND.AWQ_GEMM_TRITON, BACKEND.AWQ_TORCH_FUSED, BACKEND.AWQ_TORCH, BACKEND.AWQ_TORCH_INT8, BACKEND.AWQ_BITBLAS FORMAT.GEMV: BACKEND.AWQ_GEMV FORMAT.GEMV_FAST: BACKEND.AWQ_GEMV_FAST FORMAT.LLM_AWQ: BACKEND.AWQ_GEMV_FAST FORMAT.MARLIN: BACKEND.AWQ_MACHETE, BACKEND.AWQ_MARLIN FORMAT.BITBLAS: BACKEND.AWQ_BITBLAS |
| METHOD.PARO | FORMAT.PAROQUANT | BACKEND.PAROQUANT_CUDA, BACKEND.PAROQUANT_TRITON |
| METHOD.QQQ | FORMAT.QQQ | BACKEND.QQQ, BACKEND.QQQ_TORCH |
| METHOD.GGUF | FORMAT.GGUF | BACKEND.GGUF_TRITON, BACKEND.GGUF_CPP_CUDA, BACKEND.GGUF_CPP_CPU, BACKEND.GGUF_TORCH |
| METHOD.FP8 | FORMAT.FP8 | BACKEND.FP8_TORCH |
| METHOD.BITSANDBYTES | FORMAT.BITSANDBYTES | BACKEND.BITSANDBYTES |
| METHOD.EXL3 | FORMAT.EXL3 | BACKEND.EXL3_EXLLAMA_V3, BACKEND.EXL3_TORCH |
BACKEND.VLLM, BACKEND.SGLANG, and BACKEND.MLX are external runtime backends and are not part of the native kernel matrix above.
Marlin uses GPTQMODEL_MARLIN_USE_FP32 (default: enabled) to control fp32 accumulation.
Selected public references where teams or companies explicitly mention GPT-QModel in documentation, integration notes, or quantized model usage. This is not an exhaustive customer list.
🤗 ModelCloud quantized Vortex models on HF
| Apertus | ✅ | EXAONE 3/4 | ✅ | Dots1 | ✅ | Mistral3 | ✅ | Qwen 2/3/3.5 (Next/MoE) | ✅ |
| Baichuan | ✅ | Falcon (H1 / Mamba) | ✅ | InternLM 1/2/2.5 | ✅ | Mixtral | ✅ | Qwen 2/2.5/3 VL | ✅ |
| Bloom | ✅ | FastVLM | ✅ | Kimi K2 | ✅ | MobileLLM | ✅ | Qwen 2.5/3 Omni | ✅ |
| ChatGLM | ✅ | Gemma 1-4 / 3n | ✅ | Klear | ✅ | MOSS | ✅ | RefinedWeb | ✅ |
| CodeGen | ✅ | GPTBigCode | ✅ | LING/RING | ✅ | MPT | ✅ | StableLM | ✅ |
| Cohere 1-2 | ✅ | GPT-Neo / NeoX | ✅ | Llama 1-3.3 | ✅ | Nemotron H / Omni | ✅ | StarCoder2 | ✅ |
| DBRX Converted | ✅ | GPT-2 | ✅ | Llama 3.2 VL | ✅ | Nemotron Ultra / Labs-Diffusion | ✅ | TeleChat2 | ✅ |
| Deci | ✅ | GPT-J | ✅ | Llama 4 | ✅ | OPT | ✅ | Trinity | ✅ |
| DeepSeek-V2/V3/V4/R1 | ✅ | GPT-OSS | ✅ | LongCat Flash | ✅ | OLMo2 / LLaDA2 | ✅ | Yi | ✅ |
| DeepSeek-V2-Lite | ✅ | Granite / Granite MoE | ✅ | LongLLaMA | ✅ | Ovis 1.6/2/2.5/2.6 MoE/2.6 Next | ✅ | Seed-OSS | ✅ |
| Dream | ✅ | GRIN-MoE | ✅ | Instella | ✅ | Phi 1-4 | ✅ | Voxtral | ✅ |
| ERNIE 4.5 / MoE / VL MoE | ✅ | GLM 4/4V/4.5V/4.6V/5/5.1/OCR/ASR | ✅ | GLM4 MoE / Lite / 4.5V MoE | ✅ | MiniCPM 3/O/V/V 4_6 | ✅ | PanGu-α | ✅ |
| XVERSE | ✅ | Brumby | ✅ | Hymba | ✅ | Mistral | ✅ | Qwen 1/2/3/3.5 | ✅ |
| MiniMax M2 | ✅ | AfMoE | ✅ | Bailing-MoE | ✅ | LFM2-MoE | ✅ | Marin | ✅ |
| InternVL Chat | ✅ | Laguna | ✅ | Mimo / Mimo V2 | ✅ | Zamba / Zamba2 | ✅ | Intern S1 | ✅ |
Prism Bonsai GGUF checkpoints are supported for inference only through GPT-QModel's native GGUF path and internal GGUF runtime. Bonsai checkpoints load through the normal model path or repo argument and do not require the external gguf package. Prism model quantization is not included.
GPT-QModel is validated on Linux, macOS, and Windows 11:
| 🐧 Linux | NVIDIA GPU | ✅ | Turing+ (sm_75+) | Machete, Marlin, Exllama V3 / EXL3, Exllama V2, AWQ GEMM/GEMV, ParoQuant CUDA/Triton, GGUF CUDA/Triton, QQQ, BitBLAS, Triton, BitsAndBytes, Torch |
| 🐧 Linux | AMD GPU | ✅ | 7900XT+, ROCm 6.2+ | Exllama V2, AWQ GEMM/GEMV, QQQ, FP8 Torch, Torch |
| 🐧 Linux | Huawei Ascend NPU | ✅ | Ascend 910B, torch-npu / CANN | Native Torch kernels for GPTQ, AWQ, ParoQuant, GGUF, QQQ, and EXL3 |
| 🐧 Linux | Intel XPU | ✅ | Arc, Datacenter Max | TorchFused, TorchFusedAWQ, FP8 Torch, Torch |
| 🐧 Linux | Intel/AMD CPU | ✅ | avx, amx | TorchFused, TorchFusedAWQ, TorchAten int4, TorchInt8, GGUF C++, BitsAndBytes, Torch |
| 🍎 macOS | GPU (Metal) / CPU | ✅ | Apple Silicon, M1+ | Torch, FP8 Torch, MLX via conversion |
| 🪟 Windows | GPU (NVIDIA) / CPU | ✅ | NVIDIA | Torch |
Marlin and JIT CUDA kernels now support NVIDIA Turing+ (sm_75+) GPUs. Huawei Ascend NPU support uses native Torch kernels through torch-npu / CANN.
The package depends on ninja for first-use JIT kernel compilation.
Three-line API to use GPT-QModel for GPTQ model inference:
To use models from ModelScope instead of HuggingFace Hub, set an environment variable:
Some AWQ and ParoQuant CUDA/Triton kernels support an fp32 accumulation mode to reduce numerical drift during fused quantized matmul. This setting defaults to True because accuracy is prioritized over speed.
Notes:
Basic example of using GPT-QModel to quantize an LLM model:
QuantizeConfig remains the broad factory. The concrete config classes are now GPTQConfig, AWQConfig, ParoConfig, QQQConfig, RTNConfig, GGUFConfig, FP8Config, BitsAndBytesConfig, and EXL3Config.
GPTQ, AWQ, ParoQuant, and EXL3 are calibration-based. GGUF and FP8 are weight-only and should be quantized with calibration=None.
preprocessors=[...] adds optional module-weight preparation steps before quantization or repacking. They are available on GPTQConfig, AWQConfig, ParoConfig, RTNConfig, GGUFConfig, FP8Config, and BitsAndBytesConfig.
Some MoE (mixture of experts) models have extremely uneven/biased routing (distribution of tokens) to the experts causing some expert modules to receive close-to-zero activated tokens, thus failing to complete calibration-based quantization (GPTQ/AWQ). To better quantize these heavily biased MoE routed modules, GPT-QModel exposes 3 controls:
FailSafe can be combined with ExpertsRoutingOverride. There is no single best way to quantize MoE, and we recommend users to test all three methods.
GPT-QModel supports EoRA, a LoRA method developed by Nvidia that can further improve the accuracy of the quantized model.
Read the gptqmodel/models/llama.py code which explains in detail via comments how the model support is defined. Use it as a guide for PRs to add new models. Most models follow the same pattern.
GPT-QModel evaluation is integrated into Evalution, a modern benchmarking toolkit with 150+ of the world's most widely used benchmark suites. We highly recommend using Evalution to measure post-quant accuracy recovery after quantization instead of relying on narrow regression-only language-model metrics.
Below is a short example running gsm8k_platinum through Evalution's native GPT-QModel engine.
QuantizeConfig.dynamic is a dynamic control that allows specific matching modules to be skipped for quantization (negative matching) or have a unique [bits, group_size, sym, desc_act, mse, pack_dtype] property override per matching module vs base QuantizeConfig (positive match with override).
Sample QuantizeConfig.dynamic usage:
Group Aware Reordering (GAR) is an enhanced activation reordering scheme developed by Intel to improve the accuracy of quantized models without incurring additional inference overhead. Unlike traditional activation reordering, GAR restricts permutations to within individual groups or rearrangements of entire groups. This ensures each group's associated scales and zero-points remain efficiently accessible during inference, thereby avoiding any inference-time overhead.
How to enable GAR:
Set the act_group_aware parameter to True and disable the default activation reordering by setting desc_act to False in your QuantizeConfig. For example:
Enable GPTAQ quantization by setting gptaq = GPTAQConfig(...).
FOEM (First-order error matters) adds first-order error compensation for GPTQ-style quantization. Enable FOEM by setting foem = FOEMConfig(...).
GPT-QModel has fully supplanted AutoGPTQ and AutoAWQ for HF Transformers/Optimum/Peft integration. Model inference has drop-in support with zero changes.
For model quantization, there are some config changes for AutoAWQ:
Models quantized by GPT-QModel are inference compatible with HF Transformers (minus dynamic), vLLM, and SGLang.
GPT-QModel uses a shared LogBar logger. Set the level once near process startup:
If your script imports multiple Triton users (for example gptqmodel, vllm, and sglang), apply the patch at the very top before other Triton-related imports: