View all files | ||||
BitBLAS is a library to support mixed-precision BLAS operations on GPUs, for example, the $W_{wdtype}A_{adtype}$ mixed-precision matrix multiplication where $C_{cdtype}[M, N] = A_{adtype}[M, K] \times W_{wdtype}[N, K]$. BitBLAS aims to support efficient mixed-precision DNN model deployment, especially the $W_{wdtype}A_{adtype}$ quantization in large language models (LLMs), for example, the $W_{UINT4}A_{FP16}$ in GPTQ, the $W_{INT2}A_{FP16}$ in BitDistiller, the $W_{INT2}A_{INT8}$ in BitNet-b1.58. BitBLAS is based on techniques from our paper "Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation" at OSDI'24.
Some of the key features of BitBLAS include:
BitBLAS achieves exceptional performance across a variety of computational patterns. Below are selected results showcasing its capabilities:
End2End Integration with Quantize Inference Kernel for AutoGPTQ and vLLM.
Weight Only Matmul performance on A100
TensorCore FP16/INT8 GEMM Performance Vs. Vendor Library on A100 and RTX4090
For more detailed information on benchmark sets with other formats (NF4/FP4) and other devices (RTX 3090), please refer to the benchmark.
| BF16 | BF16 | FP32 | FP16 | √ | A100(SM_80)/A6000(SM_86) |
| BF16 | FP4_E2M1 | FP32 | FP16 | √ | A100(SM_80)/A6000(SM_86) |
| BF16 | FP8_E4M3 | FP32 | FP16 | √ | A100(SM_80)/A6000(SM_86) |
| BF16 | INT8 | FP32 | FP16 | √ | A100(SM_80)/A6000(SM_86) |
| BF16 | UINT4/INT4 | FP32 | FP16 | √ | A100(SM_80)/A6000(SM_86) |
| BF16 | UINT2/INT2 | FP32 | FP16 | √ | A100(SM_80)/A6000(SM_86) |
| BF16 | UINT1 | FP32 | FP16 | √ | A100(SM_80)/A6000(SM_86) |
| BF16 | NF4 | FP32 | FP16 | √ | A100(SM_80)/A6000(SM_86) |
| FP16 | FP16 | FP32/FP16 | FP16 | √ | V100(SM_70)/A100(SM_80)/A6000(SM_86)/RTX 4090(SM_89) |
| FP16 | FP4_E2M1 | FP32/FP16 | FP16 | √ | V100(SM_70)/A100(SM_80)/A6000(SM_86)/RTX 4090(SM_89) |
| FP16 | FP8_E4M3 | FP32/FP16 | FP16 | √ | V100(SM_70)/A100(SM_80)/A6000(SM_86)/RTX 4090(SM_89) |
| FP16 | INT8 | FP32/FP16 | FP16 | √ | V100(SM_70)/A100(SM_80)/A6000(SM_86)/RTX 4090(SM_89) |
| FP16 | UINT4/INT4 | FP32/FP16 | FP16 | √ | V100(SM_70)/A100(SM_80)/A6000(SM_86)/RTX 4090(SM_89) |
| FP16 | UINT2/INT2 | FP32/FP16 | FP16 | √ | V100(SM_70)/A100(SM_80)/A6000(SM_86)/RTX 4090(SM_89) |
| FP16 | UINT1 | FP32/FP16 | FP16 | √ | V100(SM_70)/A100(SM_80)/A6000(SM_86)/RTX 4090(SM_89) |
| FP16 | NF4 | FP32/FP16 | FP16 | √ | V100(SM_70)/A100(SM_80)/A6000(SM_86)/RTX 4090(SM_89) |
| INT8 | INT8 | INT32 | FP32/INT32/FP16/INT8 | √ | V100(SM_70)/A100(SM_80)/A6000(SM_86)/RTX 4090(SM_89) |
| INT8 | UINT4/INT4 | INT32 | FP32/INT32/FP16/INT8 | √ | V100(SM_70)/A100(SM_80)/A6000(SM_86)/RTX 4090(SM_89) |
| INT8 | UINT2/INT2 | INT32 | FP32/INT32/FP16/INT8 | √ | V100(SM_70)/A100(SM_80)/A6000(SM_86)/RTX 4090(SM_89) |
| INT8 | UINT1 | INT32 | FP32/INT32/FP16/INT8 | √ | V100(SM_70)/A100(SM_80)/A6000(SM_86)/RTX 4090(SM_89) |
| FP8_E4M3 | FP8_E4M3 | FP32 | FP32/FP16 | √ | RTX 4090(SM_89) |
| FP8_E5M2 | FP8_E5M2 | FP32 | FP32/FP16 | √ | RTX 4090(SM_89) |
| INT4 | INT4 | INT32 | FP32/FP16 | √ | RTX 4090(SM_89) |
| INT4 | INT4 | INT32 | FP32/FP16 | √ | RTX 4090(SM_89) |
We are continuously expanding the support matrix. If you have any specific requirements, please feel free to open an issue or PR.
Prerequisites for installation via wheel or PyPI
The easiest way to install BitBLAS is directly from the PyPi using pip. To install the latest version, run the following command in your terminal.
Alternatively, to install the latest version of BitBLAS from the github repository, you can run the following command:
After installing BitBLAS, you can verify the installation by running:
Note: Currently, BitBLAS whl is only supported on Ubuntu 20.04 or later version as we build the whl files on this platform. Currently we only provide whl files for CUDA>=11.0 and with Python>=3.8. If you are using a different platform or environment, you may need to build BitBLAS from source. More installation methods can be found in the installation document.
BitBLAS provides two Python APIs to perform mixed-precision matrix multiplication:
Here is an example for a $W_{INT4}A_{FP16}$ mixed-precision matrix multiplication: $out_{FP16}[M, N] = A_{FP16}[M, K] \times W_{INT4}[N, K]$, this example includes the creation of input matrices, quantization of weight matrices, and execution of the matrix multiplication with the bitblas.Matmul API. The result is then compared against a reference result obtained through conventional methods to ensure accuracy.
Note: More examples can be found in the QuickStart document.
Installation: The installation document of BitBLAS. Make sure you already have the cuda toolkit (version >= 11.0) installed in the system.
QuickStart: This document provides examples to use BitBLAS in your program with bitblas.Matmul and bitblas.Linear.
Python API: The Python API document of BitBLAS. BitBLAS provides two Python APIs to perform mixed-precision matrix multiplication:
Integration: Explore how BitBLAS seamlessly integrates with LLM deployment frameworks through our examples. Discover the ease of integrating BitBLAS with PyTorch, AutoGPTQ, and vLLM in the 3rd-party integration examples.
Customization: BitBLAS supports implementing customized mixed-precision DNN operations (e.g., Conv2D) rather than matrix multiplication with the flexible DSL (TIR Script).
Please cite BitBLAS/Ladder in your publications if it helps your research:
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.