[Docs] Add CUDA graph support to docs (#2148)
This commit is contained in:
parent
c3372e87be
commit
26c52a5ea6
@ -35,6 +35,7 @@ vLLM is fast with:
|
|||||||
- State-of-the-art serving throughput
|
- State-of-the-art serving throughput
|
||||||
- Efficient management of attention key and value memory with **PagedAttention**
|
- Efficient management of attention key and value memory with **PagedAttention**
|
||||||
- Continuous batching of incoming requests
|
- Continuous batching of incoming requests
|
||||||
|
- Fast model execution with CUDA/HIP graph
|
||||||
- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [SqueezeLLM](https://arxiv.org/abs/2306.07629)
|
- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [SqueezeLLM](https://arxiv.org/abs/2306.07629)
|
||||||
- Optimized CUDA kernels
|
- Optimized CUDA kernels
|
||||||
|
|
||||||
@ -45,7 +46,7 @@ vLLM is flexible and easy to use with:
|
|||||||
- Tensor parallelism support for distributed inference
|
- Tensor parallelism support for distributed inference
|
||||||
- Streaming outputs
|
- Streaming outputs
|
||||||
- OpenAI-compatible API server
|
- OpenAI-compatible API server
|
||||||
- Support NVIDIA GPUs and AMD GPUs.
|
- Support NVIDIA GPUs and AMD GPUs
|
||||||
|
|
||||||
vLLM seamlessly supports many Hugging Face models, including the following architectures:
|
vLLM seamlessly supports many Hugging Face models, including the following architectures:
|
||||||
|
|
||||||
|
@ -30,6 +30,7 @@ vLLM is fast with:
|
|||||||
* State-of-the-art serving throughput
|
* State-of-the-art serving throughput
|
||||||
* Efficient management of attention key and value memory with **PagedAttention**
|
* Efficient management of attention key and value memory with **PagedAttention**
|
||||||
* Continuous batching of incoming requests
|
* Continuous batching of incoming requests
|
||||||
|
* Fast model execution with CUDA/HIP graph
|
||||||
* Quantization: `GPTQ <https://arxiv.org/abs/2210.17323>`_, `AWQ <https://arxiv.org/abs/2306.00978>`_, `SqueezeLLM <https://arxiv.org/abs/2306.07629>`_
|
* Quantization: `GPTQ <https://arxiv.org/abs/2210.17323>`_, `AWQ <https://arxiv.org/abs/2306.00978>`_, `SqueezeLLM <https://arxiv.org/abs/2306.07629>`_
|
||||||
* Optimized CUDA kernels
|
* Optimized CUDA kernels
|
||||||
|
|
||||||
@ -40,7 +41,7 @@ vLLM is flexible and easy to use with:
|
|||||||
* Tensor parallelism support for distributed inference
|
* Tensor parallelism support for distributed inference
|
||||||
* Streaming outputs
|
* Streaming outputs
|
||||||
* OpenAI-compatible API server
|
* OpenAI-compatible API server
|
||||||
* Support NVIDIA GPUs and AMD GPUs.
|
* Support NVIDIA GPUs and AMD GPUs
|
||||||
|
|
||||||
For more information, check out the following:
|
For more information, check out the following:
|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user