Write README and front page of doc (#147)

This commit is contained in:
Woosuk Kwon 2023-06-18 03:19:38 -07:00 committed by GitHub
parent bf5f121c02
commit dcda03b4cb
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
9 changed files with 65 additions and 60 deletions

View File

@ -1,66 +1,54 @@
# vLLM # vLLM: Easy, Fast, and Cheap LLM Serving for Everyone
## Build from source | [**Documentation**](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/) | [**Blog**]() |
```bash vLLM is a fast and easy-to-use library for LLM inference and serving.
pip install -r requirements.txt
pip install -e . # This may take several minutes.
```
## Test simple server ## Latest News 🔥
```bash - [2023/06] We officially released vLLM! vLLM has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid April. Check out our [blog post]().
# Single-GPU inference.
python examples/simple_server.py # --model <your_model>
# Multi-GPU inference (e.g., 2 GPUs). ## Getting Started
ray start --head
python examples/simple_server.py -tp 2 # --model <your_model>
```
The detailed arguments for `simple_server.py` can be found by: Visit our [documentation](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/) to get started.
```bash - [Installation](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/getting_started/installation.html): `pip install vllm`
python examples/simple_server.py --help - [Quickstart](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/getting_started/quickstart.html)
``` - [Supported Models](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/models/supported_models.html)
## FastAPI server ## Key Features
To start the server: vLLM comes with many powerful features that include:
```bash
ray start --head
python -m vllm.entrypoints.fastapi_server # --model <your_model>
```
To test the server: - State-of-the-art performance in serving throughput
```bash - Efficient management of attention key and value memory with **PagedAttention**
python test_cli_client.py - Seamless integration with popular HuggingFace models
``` - Dynamic batching of incoming requests
- Optimized CUDA kernels
- High-throughput serving with various decoding algorithms, including *parallel sampling* and *beam search*
- Tensor parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
## Gradio web server ## Performance
Install the following additional dependencies: vLLM outperforms HuggingFace Transformers (HF) by up to 24x and Text Generation Inference (TGI) by up to 3.5x, in terms of throughput.
```bash For details, check out our [blog post]().
pip install gradio
```
Start the server: <p align="center">
```bash <img src="./assets/figures/perf_a10g_n1.png" width="45%">
python -m vllm.http_frontend.fastapi_frontend <img src="./assets/figures/perf_a100_n1.png" width="45%">
# At another terminal <br>
python -m vllm.http_frontend.gradio_webserver <em> Serving throughput when each request asks for 1 output completion. </em>
``` </p>
## Load LLaMA weights <p align="center">
<img src="./assets/figures/perf_a10g_n3.png" width="45%">
<img src="./assets/figures/perf_a100_n3.png" width="45%">
<br>
<em> Serving throughput when each request asks for 3 output completions. </em>
</p>
Since LLaMA weight is not fully public, we cannot directly download the LLaMA weights from huggingface. Therefore, you need to follow the following process to load the LLaMA weights. ## Contributing
1. Converting LLaMA weights to huggingface format with [this script](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py). We welcome and value any contributions and collaborations.
```bash Please check out [CONTRIBUTING.md](./CONTRIBUTING.md) for how to get involved.
python src/transformers/models/llama/convert_llama_weights_to_hf.py \
--input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path/llama-7b
```
2. For all the commands above, specify the model with `--model /output/path/llama-7b` to load the model. For example:
```bash
python simple_server.py --model /output/path/llama-7b
python -m vllm.http_frontend.fastapi_frontend --model /output/path/llama-7b
```

Binary file not shown.

After

Width:  |  Height:  |  Size: 357 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 343 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 322 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 338 KiB

View File

@ -3,17 +3,20 @@
Installation Installation
============ ============
vLLM is a Python library that includes some C++ and CUDA code. vLLM is a Python library that also contains some C++ and CUDA code.
vLLM can run on systems that meet the following requirements: This additional code requires compilation on the user's machine.
Requirements
------------
* OS: Linux * OS: Linux
* Python: 3.8 or higher * Python: 3.8 or higher
* CUDA: 11.0 -- 11.8 * CUDA: 11.0 -- 11.8
* GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, etc.) * GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, etc.)
.. note:: .. note::
As of now, vLLM does not support CUDA 12. As of now, vLLM does not support CUDA 12.
If you are using Hopper or Lovelace GPUs, please use CUDA 11.8. If you are using Hopper or Lovelace GPUs, please use CUDA 11.8 instead of CUDA 12.
.. tip:: .. tip::
If you have trouble installing vLLM, we recommend using the NVIDIA PyTorch Docker image. If you have trouble installing vLLM, we recommend using the NVIDIA PyTorch Docker image.
@ -45,7 +48,7 @@ You can install vLLM using pip:
Build from source Build from source
----------------- -----------------
You can also build and install vLLM from source. You can also build and install vLLM from source:
.. code-block:: console .. code-block:: console

View File

@ -1,7 +1,21 @@
Welcome to vLLM! Welcome to vLLM!
================ ================
vLLM is a high-throughput and memory-efficient inference and serving engine for large language models (LLM). **vLLM** is a fast and easy-to-use library for LLM inference and serving.
Its core features include:
- State-of-the-art performance in serving throughput
- Efficient management of attention key and value memory with **PagedAttention**
- Seamless integration with popular HuggingFace models
- Dynamic batching of incoming requests
- Optimized CUDA kernels
- High-throughput serving with various decoding algorithms, including *parallel sampling* and *beam search*
- Tensor parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
For more information, please refer to our `blog post <>`_.
Documentation Documentation
------------- -------------

View File

@ -3,7 +3,7 @@
Supported Models Supported Models
================ ================
vLLM supports a variety of generative Transformer models in `HuggingFace Transformers <https://github.com/huggingface/transformers>`_. vLLM supports a variety of generative Transformer models in `HuggingFace Transformers <https://huggingface.co/models>`_.
The following is the list of model architectures that are currently supported by vLLM. The following is the list of model architectures that are currently supported by vLLM.
Alongside each architecture, we include some popular models that use it. Alongside each architecture, we include some popular models that use it.
@ -18,7 +18,7 @@ Alongside each architecture, we include some popular models that use it.
* - :code:`GPTNeoXForCausalLM` * - :code:`GPTNeoXForCausalLM`
- GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM - GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM
* - :code:`LlamaForCausalLM` * - :code:`LlamaForCausalLM`
- LLaMA, Vicuna, Alpaca, Koala - LLaMA, Vicuna, Alpaca, Koala, Guanaco
* - :code:`OPTForCausalLM` * - :code:`OPTForCausalLM`
- OPT, OPT-IML - OPT, OPT-IML

View File

@ -165,7 +165,7 @@ setuptools.setup(
"Topic :: Scientific/Engineering :: Artificial Intelligence", "Topic :: Scientific/Engineering :: Artificial Intelligence",
], ],
packages=setuptools.find_packages( packages=setuptools.find_packages(
exclude=("benchmarks", "csrc", "docs", "examples", "tests")), exclude=("assets", "benchmarks", "csrc", "docs", "examples", "tests")),
python_requires=">=3.8", python_requires=">=3.8",
install_requires=get_requirements(), install_requires=get_requirements(),
ext_modules=ext_modules, ext_modules=ext_modules,