[Doc] Documentation for distributed inference (#261)
This commit is contained in:
parent
0b7db411b5
commit
2cf1a333b6
3
.gitignore
vendored
3
.gitignore
vendored
@ -170,3 +170,6 @@ cython_debug/
|
||||
|
||||
# Python pickle files
|
||||
*.pkl
|
||||
|
||||
# Sphinx documentation
|
||||
_build/
|
||||
|
@ -28,7 +28,7 @@ vLLM is fast with:
|
||||
|
||||
- State-of-the-art serving throughput
|
||||
- Efficient management of attention key and value memory with **PagedAttention**
|
||||
- Dynamic batching of incoming requests
|
||||
- Continuous batching of incoming requests
|
||||
- Optimized CUDA kernels
|
||||
|
||||
vLLM is flexible and easy to use with:
|
||||
|
@ -29,7 +29,7 @@ vLLM is fast with:
|
||||
|
||||
* State-of-the-art serving throughput
|
||||
* Efficient management of attention key and value memory with **PagedAttention**
|
||||
* Dynamic batching of incoming requests
|
||||
* Continuous batching of incoming requests
|
||||
* Optimized CUDA kernels
|
||||
|
||||
vLLM is flexible and easy to use with:
|
||||
@ -40,7 +40,11 @@ vLLM is flexible and easy to use with:
|
||||
* Streaming outputs
|
||||
* OpenAI-compatible API server
|
||||
|
||||
For more information, please refer to our `blog post <https://vllm.ai>`_.
|
||||
For more information, check out the following:
|
||||
|
||||
* `vLLM announcing blog post <https://vllm.ai>`_ (intro to PagedAttention)
|
||||
* `How continuous batching enables 23x throughput in LLM inference while reducing p50 latency <https://www.anyscale.com/blog/continuous-batching-llm-inference>`_ by Cade Daniel et al.
|
||||
|
||||
|
||||
|
||||
Documentation
|
||||
@ -53,6 +57,12 @@ Documentation
|
||||
getting_started/installation
|
||||
getting_started/quickstart
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:caption: Serving
|
||||
|
||||
serving/distributed_serving
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:caption: Models
|
||||
|
38
docs/source/serving/distributed_serving.rst
Normal file
38
docs/source/serving/distributed_serving.rst
Normal file
@ -0,0 +1,38 @@
|
||||
.. _distributed_serving:
|
||||
|
||||
Distributed Inference and Serving
|
||||
=================================
|
||||
|
||||
vLLM supports distributed tensor-parallel inference and serving. Currently, we support `Megatron-LM's tensor parallel algorithm <https://arxiv.org/pdf/1909.08053.pdf>`_. We manage the distributed runtime with `Ray <https://github.com/ray-project/ray>`_. To run distributed inference, install Ray with:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ pip install ray
|
||||
|
||||
To run multi-GPU inference with the :code:`LLM` class, set the :code:`tensor_parallel_size` argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from vllm import LLM
|
||||
llm = LLM("facebook/opt-13b", tensor_parallel_size=4)
|
||||
output = llm.generate("San Franciso is a")
|
||||
|
||||
To run multi-GPU serving, pass in the :code:`--tensor-parallel-size` argument when starting the server. For example, to run API server on 4 GPUs:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ python -m vllm.entrypoints.api_server \
|
||||
$ --model facebook/opt-13b \
|
||||
$ --tensor-parallel-size 4
|
||||
|
||||
To scale vLLM beyond a single machine, start a `Ray runtime <https://docs.ray.io/en/latest/ray-core/starting-ray.html>`_ via CLI before running vLLM:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ # On head node
|
||||
$ ray start --head
|
||||
|
||||
$ # On worker nodes
|
||||
$ ray start --address=<ray-head-address>
|
||||
|
||||
After that, you can run inference and serving on multiple machines by launching the vLLM process on the head node by setting :code:`tensor_parallel_size` to the number of GPUs to be the total number of GPUs across all machines.
|
Loading…
x
Reference in New Issue
Block a user