vllm/README.md

<p align="center">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-dark.png">
    <img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-light.png" width=55%>
  </picture>
</p>

<h3 align="center">
Easy, fast, and cheap LLM serving for everyone
</h3>

<p align="center">
| <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://vllm.ai"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://discord.gg/jz7wjKhh6g"><b>Discord</b></a> |

</p>

---

**The Second vLLM Bay Area Meetup (Jan 31st 5pm-7:30pm PT)**

We are thrilled to announce our second vLLM Meetup!
The vLLM team will share recent updates and roadmap.
We will also have vLLM collaborators from IBM coming up to the stage to discuss their insights on LLM optimizations.
Please register [here](https://lu.ma/ygxbpzhl) and join us!

---

*Latest News* 🔥
- [2023/12] Added ROCm support to vLLM.
- [2023/10] We hosted [the first vLLM meetup](https://lu.ma/first-vllm-meetup) in SF! Please find the meetup slides [here](https://docs.google.com/presentation/d/1QL-XPFXiFpDBh86DbEegFXBXFXjix4v032GhShbKf3s/edit?usp=sharing).
- [2023/09] We created our [Discord server](https://discord.gg/jz7wjKhh6g)! Join us to discuss vLLM and LLM serving! We will also post the latest announcements and updates there.
- [2023/09] We released our [PagedAttention paper](https://arxiv.org/abs/2309.06180) on arXiv!
- [2023/08] We would like to express our sincere gratitude to [Andreessen Horowitz](https://a16z.com/2023/08/30/supporting-the-open-source-ai-community/) (a16z) for providing a generous grant to support the open-source development and research of vLLM.
- [2023/07] Added support for LLaMA-2! You can run and serve 7B/13B/70B LLaMA-2s on vLLM with a single command!
- [2023/06] Serving vLLM On any Cloud with SkyPilot. Check out a 1-click [example](https://github.com/skypilot-org/skypilot/blob/master/llm/vllm) to start the vLLM demo, and the [blog post](https://blog.skypilot.co/serving-llm-24x-faster-on-the-cloud-with-vllm-and-skypilot/) for the story behind vLLM development on the clouds.
- [2023/06] We officially released vLLM! FastChat-vLLM integration has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid-April. Check out our [blog post](https://vllm.ai).

---
## About
vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

- State-of-the-art serving throughput
- Efficient management of attention key and value memory with **PagedAttention**
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [SqueezeLLM](https://arxiv.org/abs/2306.07629)
- Optimized CUDA kernels

vLLM is flexible and easy to use with:

- Seamless integration with popular Hugging Face models
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
- Tensor parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
- Support NVIDIA GPUs and AMD GPUs

vLLM seamlessly supports many Hugging Face models, including the following architectures:

- Aquila & Aquila2 (`BAAI/AquilaChat2-7B`, `BAAI/AquilaChat2-34B`, `BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc.)
- Baichuan & Baichuan2 (`baichuan-inc/Baichuan2-13B-Chat`, `baichuan-inc/Baichuan-7B`, etc.)
- BLOOM (`bigscience/bloom`, `bigscience/bloomz`, etc.)
- ChatGLM (`THUDM/chatglm2-6b`, `THUDM/chatglm3-6b`, etc.)
- DeciLM (`Deci/DeciLM-7B`, `Deci/DeciLM-7B-instruct`, etc.)
- Falcon (`tiiuae/falcon-7b`, `tiiuae/falcon-40b`, `tiiuae/falcon-rw-7b`, etc.)
- GPT-2 (`gpt2`, `gpt2-xl`, etc.)
- GPT BigCode (`bigcode/starcoder`, `bigcode/gpt_bigcode-santacoder`, etc.)
- GPT-J (`EleutherAI/gpt-j-6b`, `nomic-ai/gpt4all-j`, etc.)
- GPT-NeoX (`EleutherAI/gpt-neox-20b`, `databricks/dolly-v2-12b`, `stabilityai/stablelm-tuned-alpha-7b`, etc.)
- InternLM (`internlm/internlm-7b`, `internlm/internlm-chat-7b`, etc.)
- LLaMA & LLaMA-2 (`meta-llama/Llama-2-70b-hf`, `lmsys/vicuna-13b-v1.3`, `young-geng/koala`, `openlm-research/open_llama_13b`, etc.)
- Mistral (`mistralai/Mistral-7B-v0.1`, `mistralai/Mistral-7B-Instruct-v0.1`, etc.)
- Mixtral (`mistralai/Mixtral-8x7B-v0.1`, `mistralai/Mixtral-8x7B-Instruct-v0.1`, etc.)
- MPT (`mosaicml/mpt-7b`, `mosaicml/mpt-30b`, etc.)
- OPT (`facebook/opt-66b`, `facebook/opt-iml-max-30b`, etc.)
- Phi (`microsoft/phi-1_5`, `microsoft/phi-2`, etc.)
- Qwen (`Qwen/Qwen-7B`, `Qwen/Qwen-7B-Chat`, etc.)
- Yi (`01-ai/Yi-6B`, `01-ai/Yi-34B`, etc.)

Install vLLM with pip or [from source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source):

```bash
pip install vllm
```

## Getting Started

Visit our [documentation](https://vllm.readthedocs.io/en/latest/) to get started.
- [Installation](https://vllm.readthedocs.io/en/latest/getting_started/installation.html)
- [Quickstart](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html)
- [Supported Models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)

## Contributing

We welcome and value any contributions and collaborations.
Please check out [CONTRIBUTING.md](./CONTRIBUTING.md) for how to get involved.

## Citation

If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):
```bibtex
@inproceedings{kwon2023efficient,
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}
```
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`<p align="center">`
			`<picture>`
Change image urls (#164) 2023-06-20 11:15:15 +08:00			`<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-dark.png">`
			`<img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-light.png" width=55%>`
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`</picture>`
			`</p>`
Add README 2023-02-24 12:04:49 +00:00
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`<h3 align="center">`
			`Easy, fast, and cheap LLM serving for everyone`
			`</h3>`
Add README 2023-02-24 12:04:49 +00:00
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`<p align="center">`
[FIX] Update the doc link in README.md (#1730) 2023-11-20 12:46:24 -08:00			`\| <a href="https://docs.vllm.ai"><b>Documentation</b></a> \| <a href="https://vllm.ai"><b>Blog</b></a> \| <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> \| <a href="https://discord.gg/jz7wjKhh6g"><b>Discord</b></a> \|`
Add README 2023-02-24 12:04:49 +00:00
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`</p>`
Add README 2023-02-24 12:04:49 +00:00
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`---`
Refactor system architecture (#109) 2023-05-20 13:06:59 -07:00
Announce the second vLLM meetup (#2444) 2024-01-15 14:11:59 -08:00			`The Second vLLM Bay Area Meetup (Jan 31st 5pm-7:30pm PT)`

			`We are thrilled to announce our second vLLM Meetup!`
			`The vLLM team will share recent updates and roadmap.`
			`We will also have vLLM collaborators from IBM coming up to the stage to discuss their insights on LLM optimizations.`
			`Please register [here](https://lu.ma/ygxbpzhl) and join us!`

			`---`

Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`Latest News 🔥`
Merge EmbeddedLLM/vllm-rocm into vLLM main (#1836) Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Amir Balwel <amoooori04@gmail.com> Co-authored-by: root <kuanfu.liu@akirakan.com> Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: kuanfu <kuanfu.liu@embeddedllm.com> Co-authored-by: miloice <17350011+kliuae@users.noreply.github.com> 2023-12-08 15:16:52 +08:00			`- [2023/12] Added ROCm support to vLLM.`
Update README.md (#1292) 2023-10-08 23:15:50 -07:00			`- [2023/10] We hosted [the first vLLM meetup](https://lu.ma/first-vllm-meetup) in SF! Please find the meetup slides [here](https://docs.google.com/presentation/d/1QL-XPFXiFpDBh86DbEegFXBXFXjix4v032GhShbKf3s/edit?usp=sharing).`
[Community] Add vLLM Discord server (#1086) 2023-09-18 12:23:35 -07:00			`- [2023/09] We created our [Discord server](https://discord.gg/jz7wjKhh6g)! Join us to discuss vLLM and LLM serving! We will also post the latest announcements and updates there.`
Announce paper release (#1036) 2023-09-13 17:38:13 -07:00			`- [2023/09] We released our [PagedAttention paper](https://arxiv.org/abs/2309.06180) on arXiv!`
Fix README.md Link (#927) 2023-08-31 17:18:34 -07:00			`- [2023/08] We would like to express our sincere gratitude to [Andreessen Horowitz](https://a16z.com/2023/08/30/supporting-the-open-source-ai-community/) (a16z) for providing a generous grant to support the open-source development and research of vLLM.`
Add support for LLaMA-2 (#505) 2023-07-20 11:38:27 -07:00			`- [2023/07] Added support for LLaMA-2! You can run and serve 7B/13B/70B LLaMA-2s on vLLM with a single command!`
Add news for the vllm+skypilot example (#314) 2023-06-29 12:32:37 -07:00			`- [2023/06] Serving vLLM On any Cloud with SkyPilot. Check out a 1-click [example](https://github.com/skypilot-org/skypilot/blob/master/llm/vllm) to start the vLLM demo, and the [blog post](https://blog.skypilot.co/serving-llm-24x-faster-on-the-cloud-with-vllm-and-skypilot/) for the story behind vLLM development on the clouds.`
Update README.md (#236) 2023-06-25 16:58:06 -07:00			`- [2023/06] We officially released vLLM! FastChat-vLLM integration has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid-April. Check out our [blog post](https://vllm.ai).`
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00
			`---`
[Docs] Add "About" Heading to README.md (#2260) 2023-12-25 17:37:07 -07:00			`## About`
[Docs] Minor fix (#162) 2023-06-19 19:58:23 -07:00			`vLLM is a fast and easy-to-use library for LLM inference and serving.`
FastAPI-based working frontend (#10) 2023-03-29 14:48:56 +08:00
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`vLLM is fast with:`
FastAPI-based working frontend (#10) 2023-03-29 14:48:56 +08:00
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`- State-of-the-art serving throughput`
Write README and front page of doc (#147) 2023-06-18 03:19:38 -07:00			`- Efficient management of attention key and value memory with PagedAttention`
[Doc] Documentation for distributed inference (#261) 2023-06-26 11:34:23 -07:00			`- Continuous batching of incoming requests`
[Docs] Add CUDA graph support to docs (#2148) 2023-12-17 01:49:20 -08:00			`- Fast model execution with CUDA/HIP graph`
[Docs] Add supported quantization methods to docs (#2135) 2023-12-15 13:29:22 -08:00			`- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [SqueezeLLM](https://arxiv.org/abs/2306.07629)`
Write README and front page of doc (#147) 2023-06-18 03:19:38 -07:00			`- Optimized CUDA kernels`
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00
			`vLLM is flexible and easy to use with:`

Fix typo in README.md (#1033) 2023-09-14 04:55:23 +09:00			`- Seamless integration with popular Hugging Face models`
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`- High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more`
Write README and front page of doc (#147) 2023-06-18 03:19:38 -07:00			`- Tensor parallelism support for distributed inference`
			`- Streaming outputs`
			`- OpenAI-compatible API server`
[Docs] Add CUDA graph support to docs (#2148) 2023-12-17 01:49:20 -08:00			`- Support NVIDIA GPUs and AMD GPUs`
FastAPI-based working frontend (#10) 2023-03-29 14:48:56 +08:00
Fix typo in README.md (#1033) 2023-09-14 04:55:23 +09:00			`vLLM seamlessly supports many Hugging Face models, including the following architectures:`
Add and list supported models in README (#161) 2023-06-20 10:57:46 +08:00
Add Aquila2 to README (#1331) Signed-off-by: ldwang <ftgreat@gmail.com> Co-authored-by: ldwang <ftgreat@gmail.com> 2023-10-13 03:11:16 +08:00			- Aquila & Aquila2 (`BAAI/AquilaChat2-7B`, `BAAI/AquilaChat2-34B`, `BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc.)
Normalize head weights for Baichuan 2 (#1876) 2023-11-30 20:03:58 -08:00			- Baichuan & Baichuan2 (`baichuan-inc/Baichuan2-13B-Chat`, `baichuan-inc/Baichuan-7B`, etc.)
Add support for BLOOM (#331) 2023-07-03 13:12:35 -07:00			- BLOOM (`bigscience/bloom`, `bigscience/bloomz`, etc.)
[Fix] Update Supported Models List (#1690) 2023-11-16 14:47:26 -08:00			- ChatGLM (`THUDM/chatglm2-6b`, `THUDM/chatglm3-6b`, etc.)
Added DeciLM-7b and DeciLM-7b-instruct (#2062) 2023-12-19 12:29:33 +02:00			- DeciLM (`Deci/DeciLM-7B`, `Deci/DeciLM-7B-instruct`, etc.)
Add Falcon support (new) (#592) 2023-08-02 14:04:39 -07:00			- Falcon (`tiiuae/falcon-7b`, `tiiuae/falcon-40b`, `tiiuae/falcon-rw-7b`, etc.)
Remove e.g. in README (#167) 2023-06-20 14:00:28 +08:00			- GPT-2 (`gpt2`, `gpt2-xl`, etc.)
[Docs] Add GPTBigCode to supported models (#213) 2023-06-22 15:05:11 -07:00			- GPT BigCode (`bigcode/starcoder`, `bigcode/gpt_bigcode-santacoder`, etc.)
[Model] Add support for GPT-J (#226) Co-authored-by: woWoosuk Kwon <woosuk.kwon@berkeley.edu> 2023-07-08 20:55:16 -04:00			- GPT-J (`EleutherAI/gpt-j-6b`, `nomic-ai/gpt4all-j`, etc.)
[Docs] Add GPTBigCode to supported models (#213) 2023-06-22 15:05:11 -07:00			- GPT-NeoX (`EleutherAI/gpt-neox-20b`, `databricks/dolly-v2-12b`, `stabilityai/stablelm-tuned-alpha-7b`, etc.)
Update Supported Model List (#825) 2023-08-22 11:51:44 -07:00			- InternLM (`internlm/internlm-7b`, `internlm/internlm-chat-7b`, etc.)
Add support for LLaMA-2 (#505) 2023-07-20 11:38:27 -07:00			- LLaMA & LLaMA-2 (`meta-llama/Llama-2-70b-hf`, `lmsys/vicuna-13b-v1.3`, `young-geng/koala`, `openlm-research/open_llama_13b`, etc.)
Add Mistral to supported model list (#1221) 2023-09-28 14:33:04 -07:00			- Mistral (`mistralai/Mistral-7B-v0.1`, `mistralai/Mistral-7B-Instruct-v0.1`, etc.)
Mixtral 8x7B support (#2011) Co-authored-by: Pierre Stock <p@mistral.ai> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com> 2023-12-11 10:09:15 +01:00			- Mixtral (`mistralai/Mixtral-8x7B-v0.1`, `mistralai/Mixtral-8x7B-Instruct-v0.1`, etc.)
[Model] Add support for MPT (#334) 2023-07-03 16:47:53 -07:00			- MPT (`mosaicml/mpt-7b`, `mosaicml/mpt-30b`, etc.)
Remove e.g. in README (#167) 2023-06-20 14:00:28 +08:00			- OPT (`facebook/opt-66b`, `facebook/opt-iml-max-30b`, etc.)
[Minor] Add Phi 2 to supported models (#2159) 2023-12-17 02:54:57 -08:00			- Phi (`microsoft/phi-1_5`, `microsoft/phi-2`, etc.)
Update Supported Model List (#825) 2023-08-22 11:51:44 -07:00			- Qwen (`Qwen/Qwen-7B`, `Qwen/Qwen-7B-Chat`, etc.)
[Fix] Update Supported Models List (#1690) 2023-11-16 14:47:26 -08:00			- Yi (`01-ai/Yi-6B`, `01-ai/Yi-34B`, etc.)
Add and list supported models in README (#161) 2023-06-20 10:57:46 +08:00
Fix repo & documentation URLs (#163) 2023-06-19 20:03:40 -07:00			`Install vLLM with pip or [from source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source):`
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00
			```bash
			`pip install vllm`
			```

			`## Getting Started`

Fix repo & documentation URLs (#163) 2023-06-19 20:03:40 -07:00			`Visit our [documentation](https://vllm.readthedocs.io/en/latest/) to get started.`
			`- [Installation](https://vllm.readthedocs.io/en/latest/getting_started/installation.html)`
			`- [Quickstart](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html)`
			`- [Supported Models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)`
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00
Write README and front page of doc (#147) 2023-06-18 03:19:38 -07:00			`## Contributing`
Modify README to include info on loading LLaMA (#18) 2023-04-01 01:07:57 +08:00
Write README and front page of doc (#147) 2023-06-18 03:19:38 -07:00			`We welcome and value any contributions and collaborations.`
			`Please check out [CONTRIBUTING.md](./CONTRIBUTING.md) for how to get involved.`
Announce paper release (#1036) 2023-09-13 17:38:13 -07:00
			`## Citation`

			`If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):`
			```bibtex
			`@inproceedings{kwon2023efficient,`
[Community] Add vLLM Discord server (#1086) 2023-09-18 12:23:35 -07:00			`title={Efficient Memory Management for Large Language Model Serving with PagedAttention},`
Announce paper release (#1036) 2023-09-13 17:38:13 -07:00			`author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},`
			`booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},`
			`year={2023}`
			`}`
			```