vllm/docs/source/features/quantization/supported_hardware.md

133 lines
2.0 KiB
Markdown
Raw Normal View History

(quantization-supported-hardware)=
# Supported Hardware
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
:::{list-table}
:header-rows: 1
:widths: 20 8 8 8 8 8 8 8 8 8 8
- * Implementation
* Volta
* Turing
* Ampere
* Ada
* Hopper
* AMD GPU
* Intel GPU
* x86 CPU
* AWS Inferentia
* Google TPU
- * AWQ
*
* ✅︎
* ✅︎
* ✅︎
* ✅︎
*
* ✅︎
* ✅︎
*
*
- * GPTQ
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
*
* ✅︎
* ✅︎
*
*
- * Marlin (GPTQ/AWQ/FP8)
*
*
* ✅︎
* ✅︎
* ✅︎
*
*
*
*
*
- * INT8 (W8A8)
*
* ✅︎
* ✅︎
* ✅︎
* ✅︎
*
*
* ✅︎
*
*
- * FP8 (W8A8)
*
*
*
* ✅︎
* ✅︎
* ✅︎
*
*
*
*
- * AQLM
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
*
*
*
*
*
- * bitsandbytes
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
*
*
*
*
*
- * DeepSpeedFP
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
*
*
*
*
*
- * GGUF
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
* ✅︎
*
*
*
*
:::
- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
- ✅︎ indicates that the quantization method is supported on the specified hardware.
- ❌ indicates that the quantization method is not supported on the specified hardware.
:::{note}
This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
For the most up-to-date information on hardware support and quantization methods, please refer to <gh-dir:vllm/model_executor/layers/quantization> or consult with the vLLM development team.
:::