diff --git a/.gitignore b/.gitignore index a050864d..da5a337c 100644 --- a/.gitignore +++ b/.gitignore @@ -170,3 +170,6 @@ cython_debug/ # Python pickle files *.pkl + +# Sphinx documentation +_build/ diff --git a/README.md b/README.md index 33fbab2b..2a636228 100644 --- a/README.md +++ b/README.md @@ -28,7 +28,7 @@ vLLM is fast with: - State-of-the-art serving throughput - Efficient management of attention key and value memory with **PagedAttention** -- Dynamic batching of incoming requests +- Continuous batching of incoming requests - Optimized CUDA kernels vLLM is flexible and easy to use with: diff --git a/docs/source/index.rst b/docs/source/index.rst index ab2e17a9..6fbfd01d 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -29,7 +29,7 @@ vLLM is fast with: * State-of-the-art serving throughput * Efficient management of attention key and value memory with **PagedAttention** -* Dynamic batching of incoming requests +* Continuous batching of incoming requests * Optimized CUDA kernels vLLM is flexible and easy to use with: @@ -40,7 +40,11 @@ vLLM is flexible and easy to use with: * Streaming outputs * OpenAI-compatible API server -For more information, please refer to our `blog post `_. +For more information, check out the following: + +* `vLLM announcing blog post `_ (intro to PagedAttention) +* `How continuous batching enables 23x throughput in LLM inference while reducing p50 latency `_ by Cade Daniel et al. + Documentation @@ -53,6 +57,12 @@ Documentation getting_started/installation getting_started/quickstart +.. toctree:: + :maxdepth: 1 + :caption: Serving + + serving/distributed_serving + .. toctree:: :maxdepth: 1 :caption: Models diff --git a/docs/source/serving/distributed_serving.rst b/docs/source/serving/distributed_serving.rst new file mode 100644 index 00000000..4f36dca1 --- /dev/null +++ b/docs/source/serving/distributed_serving.rst @@ -0,0 +1,38 @@ +.. _distributed_serving: + +Distributed Inference and Serving +================================= + +vLLM supports distributed tensor-parallel inference and serving. Currently, we support `Megatron-LM's tensor parallel algorithm `_. We manage the distributed runtime with `Ray `_. To run distributed inference, install Ray with: + +.. code-block:: console + + $ pip install ray + +To run multi-GPU inference with the :code:`LLM` class, set the :code:`tensor_parallel_size` argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs: + +.. code-block:: python + + from vllm import LLM + llm = LLM("facebook/opt-13b", tensor_parallel_size=4) + output = llm.generate("San Franciso is a") + +To run multi-GPU serving, pass in the :code:`--tensor-parallel-size` argument when starting the server. For example, to run API server on 4 GPUs: + +.. code-block:: console + + $ python -m vllm.entrypoints.api_server \ + $ --model facebook/opt-13b \ + $ --tensor-parallel-size 4 + +To scale vLLM beyond a single machine, start a `Ray runtime `_ via CLI before running vLLM: + +.. code-block:: console + + $ # On head node + $ ray start --head + + $ # On worker nodes + $ ray start --address= + +After that, you can run inference and serving on multiple machines by launching the vLLM process on the head node by setting :code:`tensor_parallel_size` to the number of GPUs to be the total number of GPUs across all machines. \ No newline at end of file