[Docs] Update readme (#7316)
This commit is contained in:
parent
6c8e595710
commit
f020a6297e
19
README.md
19
README.md
@ -10,7 +10,7 @@ Easy, fast, and cheap LLM serving for everyone
|
|||||||
</h3>
|
</h3>
|
||||||
|
|
||||||
<p align="center">
|
<p align="center">
|
||||||
| <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://vllm.ai"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://discord.gg/jz7wjKhh6g"><b>Discord</b></a> |
|
| <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://vllm.ai"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://discord.gg/jz7wjKhh6g"><b>Discord</b></a> | <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> |
|
||||||
|
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
@ -36,10 +36,12 @@ vLLM is fast with:
|
|||||||
- Efficient management of attention key and value memory with **PagedAttention**
|
- Efficient management of attention key and value memory with **PagedAttention**
|
||||||
- Continuous batching of incoming requests
|
- Continuous batching of incoming requests
|
||||||
- Fast model execution with CUDA/HIP graph
|
- Fast model execution with CUDA/HIP graph
|
||||||
- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [SqueezeLLM](https://arxiv.org/abs/2306.07629), FP8 KV Cache
|
- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8.
|
||||||
- Optimized CUDA kernels
|
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
|
||||||
|
- Speculative decoding
|
||||||
|
- Chunked prefill
|
||||||
|
|
||||||
**Performance benchmark**: We include a [performance benchmark](https://buildkite.com/vllm/performance-benchmark/builds/4068) that compares the performance of vllm against other LLM serving engines ([TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), [text-generation-inference](https://github.com/huggingface/text-generation-inference) and [lmdeploy](https://github.com/InternLM/lmdeploy)).
|
**Performance benchmark**: We include a [performance benchmark](https://buildkite.com/vllm/performance-benchmark/builds/4068) that compares the performance of vLLM against other LLM serving engines ([TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), [text-generation-inference](https://github.com/huggingface/text-generation-inference) and [lmdeploy](https://github.com/InternLM/lmdeploy)).
|
||||||
|
|
||||||
vLLM is flexible and easy to use with:
|
vLLM is flexible and easy to use with:
|
||||||
|
|
||||||
@ -48,20 +50,21 @@ vLLM is flexible and easy to use with:
|
|||||||
- Tensor parallelism and pipeline parallelism support for distributed inference
|
- Tensor parallelism and pipeline parallelism support for distributed inference
|
||||||
- Streaming outputs
|
- Streaming outputs
|
||||||
- OpenAI-compatible API server
|
- OpenAI-compatible API server
|
||||||
- Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs
|
- Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPU, and AWS Neuron.
|
||||||
- (Experimental) Prefix caching support
|
- Prefix caching support
|
||||||
- (Experimental) Multi-lora support
|
- Multi-lora support
|
||||||
|
|
||||||
vLLM seamlessly supports most popular open-source models on HuggingFace, including:
|
vLLM seamlessly supports most popular open-source models on HuggingFace, including:
|
||||||
- Transformer-like LLMs (e.g., Llama)
|
- Transformer-like LLMs (e.g., Llama)
|
||||||
- Mixture-of-Expert LLMs (e.g., Mixtral)
|
- Mixture-of-Expert LLMs (e.g., Mixtral)
|
||||||
|
- Embedding Models (e.g. E5-Mistral)
|
||||||
- Multi-modal LLMs (e.g., LLaVA)
|
- Multi-modal LLMs (e.g., LLaVA)
|
||||||
|
|
||||||
Find the full list of supported models [here](https://docs.vllm.ai/en/latest/models/supported_models.html).
|
Find the full list of supported models [here](https://docs.vllm.ai/en/latest/models/supported_models.html).
|
||||||
|
|
||||||
## Getting Started
|
## Getting Started
|
||||||
|
|
||||||
Install vLLM with pip or [from source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source):
|
Install vLLM with `pip` or [from source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source):
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install vllm
|
pip install vllm
|
||||||
|
@ -31,8 +31,10 @@ vLLM is fast with:
|
|||||||
* Efficient management of attention key and value memory with **PagedAttention**
|
* Efficient management of attention key and value memory with **PagedAttention**
|
||||||
* Continuous batching of incoming requests
|
* Continuous batching of incoming requests
|
||||||
* Fast model execution with CUDA/HIP graph
|
* Fast model execution with CUDA/HIP graph
|
||||||
* Quantization: `GPTQ <https://arxiv.org/abs/2210.17323>`_, `AWQ <https://arxiv.org/abs/2306.00978>`_, `SqueezeLLM <https://arxiv.org/abs/2306.07629>`_, FP8 KV Cache
|
* Quantization: `GPTQ <https://arxiv.org/abs/2210.17323>`_, `AWQ <https://arxiv.org/abs/2306.00978>`_, INT4, INT8, and FP8
|
||||||
* Optimized CUDA kernels
|
* Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
|
||||||
|
* Speculative decoding
|
||||||
|
* Chunked prefill
|
||||||
|
|
||||||
vLLM is flexible and easy to use with:
|
vLLM is flexible and easy to use with:
|
||||||
|
|
||||||
@ -41,9 +43,9 @@ vLLM is flexible and easy to use with:
|
|||||||
* Tensor parallelism and pipeline parallelism support for distributed inference
|
* Tensor parallelism and pipeline parallelism support for distributed inference
|
||||||
* Streaming outputs
|
* Streaming outputs
|
||||||
* OpenAI-compatible API server
|
* OpenAI-compatible API server
|
||||||
* Support NVIDIA GPUs and AMD GPUs
|
* Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPU, and AWS Neuron.
|
||||||
* (Experimental) Prefix caching support
|
* Prefix caching support
|
||||||
* (Experimental) Multi-lora support
|
* Multi-lora support
|
||||||
|
|
||||||
For more information, check out the following:
|
For more information, check out the following:
|
||||||
|
|
||||||
@ -53,7 +55,6 @@ For more information, check out the following:
|
|||||||
* :ref:`vLLM Meetups <meetups>`.
|
* :ref:`vLLM Meetups <meetups>`.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Documentation
|
Documentation
|
||||||
-------------
|
-------------
|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user