Write README and front page of doc (#147)

2023-06-18 03:19:38 -07:00 · 2023-06-18 03:19:38 -07:00 · dcda03b4cb
commit dcda03b4cb
parent bf5f121c02
9 changed files with 65 additions and 60 deletions
--- a/README.md
+++ b/README.md
@ -1,66 +1,54 @@
-# vLLM
+# vLLM: Easy, Fast, and Cheap LLM Serving for Everyone
-## Build from source
+| [**Documentation**](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/) | [**Blog**]() |
-```bash
+vLLM is a fast and easy-to-use library for LLM inference and serving.
 pip install -r requirements.txt
 pip install -e .  # This may take several minutes.
 ```
-## Test simple server
+## Latest News 🔥
-```bash
+- [2023/06] We officially released vLLM! vLLM has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid April. Check out our [blog post]().
 # Single-GPU inference.
 python examples/simple_server.py # --model <your_model>
-# Multi-GPU inference (e.g., 2 GPUs).
+## Getting Started
 ray start --head
 python examples/simple_server.py -tp 2 # --model <your_model>
 ```
-The detailed arguments for `simple_server.py` can be found by:
+Visit our [documentation](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/) to get started.
-```bash
+- [Installation](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/getting_started/installation.html): `pip install vllm`
-python examples/simple_server.py --help
+- [Quickstart](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/getting_started/quickstart.html)
-```
+- [Supported Models](https://llm-serving-cacheflow.readthedocs-hosted.com/_/sharing/Cyo52MQgyoAWRQ79XA4iA2k8euwzzmjY?next=/en/latest/models/supported_models.html)
-## FastAPI server
+## Key Features
-To start the server:
+vLLM comes with many powerful features that include:
 ```bash
 ray start --head
 python -m vllm.entrypoints.fastapi_server # --model <your_model>
 ```
-To test the server:
+- State-of-the-art performance in serving throughput
-```bash
+- Efficient management of attention key and value memory with **PagedAttention**
-python test_cli_client.py
+- Seamless integration with popular HuggingFace models
-```
+- Dynamic batching of incoming requests
 - Optimized CUDA kernels
 - High-throughput serving with various decoding algorithms, including *parallel sampling* and *beam search*
 - Tensor parallelism support for distributed inference
 - Streaming outputs
 - OpenAI-compatible API server
-## Gradio web server
+## Performance
-Install the following additional dependencies:
+vLLM outperforms HuggingFace Transformers (HF) by up to 24x and Text Generation Inference (TGI) by up to 3.5x, in terms of throughput.
-```bash
+For details, check out our [blog post]().
 pip install gradio
 ```
-Start the server:
+<p align="center">
-```bash
+  <img src="./assets/figures/perf_a10g_n1.png" width="45%">
-python -m vllm.http_frontend.fastapi_frontend
+  <img src="./assets/figures/perf_a100_n1.png" width="45%">
-# At another terminal
+  <br>
-python -m vllm.http_frontend.gradio_webserver
+  <em> Serving throughput when each request asks for 1 output completion. </em>
-```
+</p>
-## Load LLaMA weights
+<p align="center">
  <img src="./assets/figures/perf_a10g_n3.png" width="45%">
  <img src="./assets/figures/perf_a100_n3.png" width="45%">
  <br>
  <em> Serving throughput when each request asks for 3 output completions. </em>
 </p>
-Since LLaMA weight is not fully public, we cannot directly download the LLaMA weights from huggingface. Therefore, you need to follow the following process to load the LLaMA weights.
+## Contributing
-1. Converting LLaMA weights to huggingface format with [this script](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py).
+We welcome and value any contributions and collaborations.
-    ```bash
+Please check out [CONTRIBUTING.md](./CONTRIBUTING.md) for how to get involved.
    python src/transformers/models/llama/convert_llama_weights_to_hf.py \
        --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path/llama-7b
    ```
 2. For all the commands above, specify the model with `--model /output/path/llama-7b` to load the model. For example:
    ```bash
    python simple_server.py --model /output/path/llama-7b
    python -m vllm.http_frontend.fastapi_frontend --model /output/path/llama-7b
    ```
--- a/assets/figures/perf_a100_n1.png
+++ b/assets/figures/perf_a100_n1.png
--- a/assets/figures/perf_a100_n3.png
+++ b/assets/figures/perf_a100_n3.png
--- a/assets/figures/perf_a10g_n1.png
+++ b/assets/figures/perf_a10g_n1.png
--- a/assets/figures/perf_a10g_n3.png
+++ b/assets/figures/perf_a10g_n3.png
--- a/docs/source/getting_started/installation.rst
+++ b/docs/source/getting_started/installation.rst
@ -3,17 +3,20 @@
 Installation
 ============
-vLLM is a Python library that includes some C++ and CUDA code.
+vLLM is a Python library that also contains some C++ and CUDA code.
-vLLM can run on systems that meet the following requirements:
+This additional code requires compilation on the user's machine.
 Requirements
 ------------
 * OS: Linux
 * Python: 3.8 or higher
 * CUDA: 11.0 -- 11.8
-* GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, etc.)
+* GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, etc.)
 .. note::
    As of now, vLLM does not support CUDA 12.
-    If you are using Hopper or Lovelace GPUs, please use CUDA 11.8.
+    If you are using Hopper or Lovelace GPUs, please use CUDA 11.8 instead of CUDA 12.
 .. tip::
    If you have trouble installing vLLM, we recommend using the NVIDIA PyTorch Docker image.
@ -45,7 +48,7 @@ You can install vLLM using pip:
 Build from source
 -----------------
-You can also build and install vLLM from source.
+You can also build and install vLLM from source:
 .. code-block:: console
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -1,7 +1,21 @@
 Welcome to vLLM!
 ================
-vLLM is a high-throughput and memory-efficient inference and serving engine for large language models (LLM).
+**vLLM** is a fast and easy-to-use library for LLM inference and serving.
 Its core features include:
 - State-of-the-art performance in serving throughput
 - Efficient management of attention key and value memory with **PagedAttention**
 - Seamless integration with popular HuggingFace models
 - Dynamic batching of incoming requests
 - Optimized CUDA kernels
 - High-throughput serving with various decoding algorithms, including *parallel sampling* and *beam search*
 - Tensor parallelism support for distributed inference
 - Streaming outputs
 - OpenAI-compatible API server
 For more information, please refer to our `blog post <>`_.
 Documentation
 -------------
--- a/docs/source/models/supported_models.rst
+++ b/docs/source/models/supported_models.rst
@ -3,7 +3,7 @@
 Supported Models
 ================
-vLLM supports a variety of generative Transformer models in `HuggingFace Transformers <https://github.com/huggingface/transformers>`_.
+vLLM supports a variety of generative Transformer models in `HuggingFace Transformers <https://huggingface.co/models>`_.
 The following is the list of model architectures that are currently supported by vLLM.
 Alongside each architecture, we include some popular models that use it.
@ -18,7 +18,7 @@ Alongside each architecture, we include some popular models that use it.
  * - :code:`GPTNeoXForCausalLM`
    - GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM
  * - :code:`LlamaForCausalLM`
-    - LLaMA, Vicuna, Alpaca, Koala
+    - LLaMA, Vicuna, Alpaca, Koala, Guanaco
  * - :code:`OPTForCausalLM`
    - OPT, OPT-IML
--- a/setup.py
+++ b/setup.py
@ -165,7 +165,7 @@ setuptools.setup(
        "Topic :: Scientific/Engineering :: Artificial Intelligence",
    ],
    packages=setuptools.find_packages(
-        exclude=("benchmarks", "csrc", "docs", "examples", "tests")),
+        exclude=("assets", "benchmarks", "csrc", "docs", "examples", "tests")),
    python_requires=">=3.8",
    install_requires=get_requirements(),
    ext_modules=ext_modules,