diff --git a/README.md b/README.md index e4b3b502..4a09e3af 100644 --- a/README.md +++ b/README.md @@ -35,6 +35,7 @@ vLLM is fast with: - State-of-the-art serving throughput - Efficient management of attention key and value memory with **PagedAttention** - Continuous batching of incoming requests +- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [SqueezeLLM](https://arxiv.org/abs/2306.07629) - Optimized CUDA kernels vLLM is flexible and easy to use with: @@ -44,7 +45,7 @@ vLLM is flexible and easy to use with: - Tensor parallelism support for distributed inference - Streaming outputs - OpenAI-compatible API server -- Support NVIDIA CUDA and AMD ROCm. +- Support NVIDIA GPUs and AMD GPUs. vLLM seamlessly supports many Hugging Face models, including the following architectures: diff --git a/docs/source/index.rst b/docs/source/index.rst index 04af0907..46620261 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -30,6 +30,7 @@ vLLM is fast with: * State-of-the-art serving throughput * Efficient management of attention key and value memory with **PagedAttention** * Continuous batching of incoming requests +* Quantization: `GPTQ `_, `AWQ `_, `SqueezeLLM `_ * Optimized CUDA kernels vLLM is flexible and easy to use with: @@ -39,7 +40,7 @@ vLLM is flexible and easy to use with: * Tensor parallelism support for distributed inference * Streaming outputs * OpenAI-compatible API server -* Support NVIDIA CUDA and AMD ROCm. +* Support NVIDIA GPUs and AMD GPUs. For more information, check out the following: