[Docs] Add supported quantization methods to docs (#2135)

This commit is contained in:
Woosuk Kwon 2023-12-15 13:29:22 -08:00 committed by GitHub
parent 0fbfc4b81b
commit b81a6a6bb3
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 4 additions and 2 deletions

View File

@ -35,6 +35,7 @@ vLLM is fast with:
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with **PagedAttention**
- Continuous batching of incoming requests
- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [SqueezeLLM](https://arxiv.org/abs/2306.07629)
- Optimized CUDA kernels
vLLM is flexible and easy to use with:
@ -44,7 +45,7 @@ vLLM is flexible and easy to use with:
- Tensor parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
- Support NVIDIA CUDA and AMD ROCm.
- Support NVIDIA GPUs and AMD GPUs.
vLLM seamlessly supports many Hugging Face models, including the following architectures:

View File

@ -30,6 +30,7 @@ vLLM is fast with:
* State-of-the-art serving throughput
* Efficient management of attention key and value memory with **PagedAttention**
* Continuous batching of incoming requests
* Quantization: `GPTQ <https://arxiv.org/abs/2210.17323>`_, `AWQ <https://arxiv.org/abs/2306.00978>`_, `SqueezeLLM <https://arxiv.org/abs/2306.07629>`_
* Optimized CUDA kernels
vLLM is flexible and easy to use with:
@ -39,7 +40,7 @@ vLLM is flexible and easy to use with:
* Tensor parallelism support for distributed inference
* Streaming outputs
* OpenAI-compatible API server
* Support NVIDIA CUDA and AMD ROCm.
* Support NVIDIA GPUs and AMD GPUs.
For more information, check out the following: