vllm/docs/source/getting_started/quickstart.rst

.. _quickstart:

==========
Quickstart
==========

This guide will help you quickly get started with vLLM to:

* :ref:`Run offline batched inference <offline_batched_inference>` 
* :ref:`Run OpenAI-compatible inference <openai_compatible_server>`

Prerequisites
--------------
- OS: Linux
- Python: 3.8 - 3.12
- GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)

Installation
--------------

You can install vLLM using pip. It's recommended to use `conda <https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html>`_ to create and manage Python environments.

.. code-block:: console

    $ conda create -n myenv python=3.10 -y
    $ conda activate myenv
    $ pip install vllm

Please refer to the :ref:`installation documentation <installation>` for more details on installing vLLM.

.. _offline_batched_inference:

Offline Batched Inference
-------------------------

With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). The example script for this section can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference.py>`__.

The first line of this example imports the classes :class:`~vllm.LLM` and :class:`~vllm.SamplingParams`:

- :class:`~vllm.LLM` is the main class for running offline inference with vLLM engine.
- :class:`~vllm.SamplingParams` specifies the parameters for the sampling process.

.. code-block:: python

    from vllm import LLM, SamplingParams

The next section defines a list of input prompts and sampling parameters for text generation. The `sampling temperature <https://arxiv.org/html/2402.05201v1>`_ is set to ``0.8`` and the `nucleus sampling probability <https://en.wikipedia.org/wiki/Top-p_sampling>`_ is set to ``0.95``. You can find more information about the sampling parameters `here <https://docs.vllm.ai/en/stable/dev/sampling_params.html>`__.

.. code-block:: python

    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

The :class:`~vllm.LLM` class initializes vLLM's engine and the `OPT-125M model <https://arxiv.org/abs/2205.01068>`_ for offline inference. The list of supported models can be found :ref:`here <supported_models>`.

.. code-block:: python

    llm = LLM(model="facebook/opt-125m")

.. note::

    By default, vLLM downloads models from `HuggingFace <https://huggingface.co/>`_. If you would like to use models from `ModelScope <https://www.modelscope.cn>`_, set the environment variable ``VLLM_USE_MODELSCOPE`` before initializing the engine.

Now, the fun part! The outputs are generated using ``llm.generate``. It adds the input prompts to the vLLM engine's waiting queue and executes the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of ``RequestOutput`` objects, which include all of the output tokens.

.. code-block:: python

    outputs = llm.generate(prompts, sampling_params)

    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

.. _openai_compatible_server:

OpenAI-Compatible Server
------------------------

vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.
By default, it starts the server at ``http://localhost:8000``. You can specify the address with ``--host`` and ``--port`` arguments. The server currently hosts one model at a time and implements endpoints such as `list models <https://platform.openai.com/docs/api-reference/models/list>`_, `create chat completion <https://platform.openai.com/docs/api-reference/chat/completions/create>`_, and `create completion <https://platform.openai.com/docs/api-reference/completions/create>`_ endpoints. 

Run the following command to start the vLLM server with the `Qwen2.5-1.5B-Instruct <https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct>`_ model:

.. code-block:: console

    $ vllm serve Qwen/Qwen2.5-1.5B-Instruct

.. note::

    By default, the server uses a predefined chat template stored in the tokenizer. You can learn about overriding it `here <https://github.com/vllm-project/vllm/blob/main/docs/source/serving/openai_compatible_server.md#chat-template>`__.

This server can be queried in the same format as OpenAI API. For example, to list the models:

.. code-block:: console

    $ curl http://localhost:8000/v1/models

You can pass in the argument ``--api-key`` or environment variable ``VLLM_API_KEY`` to enable the server to check for API key in the header.

OpenAI Completions API with vLLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Once your server is started, you can query the model with input prompts:

.. code-block:: console

    $ curl http://localhost:8000/v1/completions \
    $     -H "Content-Type: application/json" \
    $     -d '{
    $         "model": "Qwen/Qwen2.5-1.5B-Instruct",
    $         "prompt": "San Francisco is a",
    $         "max_tokens": 7,
    $         "temperature": 0
    $     }'

Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the ``openai`` python package:

.. code-block:: python

    from openai import OpenAI

    # Modify OpenAI's API key and API base to use vLLM's API server.
    openai_api_key = "EMPTY"
    openai_api_base = "http://localhost:8000/v1"
    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )
    completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct",
                                          prompt="San Francisco is a")
    print("Completion result:", completion)

A more detailed client example can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py>`__.

OpenAI Chat Completions API with vLLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

vLLM is designed to also support the OpenAI Chat Completions API. The chat interface is a more dynamic, interactive way to communicate with the model, allowing back-and-forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed explanations.

You can use the `create chat completion <https://platform.openai.com/docs/api-reference/chat/completions/create>`_ endpoint to interact with the model:

.. code-block:: console

    $ curl http://localhost:8000/v1/chat/completions \
    $     -H "Content-Type: application/json" \
    $     -d '{
    $         "model": "Qwen/Qwen2.5-1.5B-Instruct",
    $         "messages": [
    $             {"role": "system", "content": "You are a helpful assistant."},
    $             {"role": "user", "content": "Who won the world series in 2020?"}
    $         ]
    $     }'

Alternatively, you can use the ``openai`` python package:

.. code-block:: python

    from openai import OpenAI
    # Set OpenAI's API key and API base to use vLLM's API server.
    openai_api_key = "EMPTY"
    openai_api_base = "http://localhost:8000/v1"

    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )

    chat_response = client.chat.completions.create(
        model="Qwen/Qwen2.5-1.5B-Instruct",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Tell me a joke."},
        ]
    )
    print("Chat response:", chat_response)
Add quickstart guide (#148) 2023-06-18 01:26:12 +08:00			`.. _quickstart:`

[Doc] Improve quickstart documentation (#9256) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-10-25 17:32:10 -04:00			`==========`
Add initial sphinx docs (#120) 2023-05-22 17:02:44 -07:00			`Quickstart`
			`==========`

[Doc] Improve quickstart documentation (#9256) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-10-25 17:32:10 -04:00			`This guide will help you quickly get started with vLLM to:`
Add quickstart guide (#148) 2023-06-18 01:26:12 +08:00
[Doc] Improve quickstart documentation (#9256) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-10-25 17:32:10 -04:00			* :ref:`Run offline batched inference <offline_batched_inference>`
			* :ref:`Run OpenAI-compatible inference <openai_compatible_server>`
Add quickstart guide (#148) 2023-06-18 01:26:12 +08:00
[Doc] Improve quickstart documentation (#9256) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-10-25 17:32:10 -04:00			`Prerequisites`
			`--------------`
			`- OS: Linux`
			`- Python: 3.8 - 3.12`
			`- GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)`
Add initial sphinx docs (#120) 2023-05-22 17:02:44 -07:00
[Doc] Improve quickstart documentation (#9256) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-10-25 17:32:10 -04:00			`Installation`
			`--------------`

			You can install vLLM using pip. It's recommended to use `conda <https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html>`_ to create and manage Python environments.

			`.. code-block:: console`
[Minor] Fix the format in quick start guide related to Model Scope (#2425) 2024-01-11 19:44:01 -08:00
[Doc] Improve quickstart documentation (#9256) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-10-25 17:32:10 -04:00			`$ conda create -n myenv python=3.10 -y`
			`$ conda activate myenv`
			`$ pip install vllm`
[Minor] Fix the format in quick start guide related to Model Scope (#2425) 2024-01-11 19:44:01 -08:00
[Doc] Improve quickstart documentation (#9256) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-10-25 17:32:10 -04:00			Please refer to the :ref:`installation documentation <installation>` for more details on installing vLLM.
[Minor] Fix the format in quick start guide related to Model Scope (#2425) 2024-01-11 19:44:01 -08:00
[Doc] Improve quickstart documentation (#9256) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-10-25 17:32:10 -04:00			`.. _offline_batched_inference:`
[Minor] Fix the format in quick start guide related to Model Scope (#2425) 2024-01-11 19:44:01 -08:00
Add quickstart guide (#148) 2023-06-18 01:26:12 +08:00			`Offline Batched Inference`
			`-------------------------`

[Doc] Improve quickstart documentation (#9256) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-10-25 17:32:10 -04:00			With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). The example script for this section can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference.py>`__.

			The first line of this example imports the classes :class:`~vllm.LLM` and :class:`~vllm.SamplingParams`:
Add quickstart guide (#148) 2023-06-18 01:26:12 +08:00
[Doc] Improve quickstart documentation (#9256) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-10-25 17:32:10 -04:00			- :class:`~vllm.LLM` is the main class for running offline inference with vLLM engine.
			- :class:`~vllm.SamplingParams` specifies the parameters for the sampling process.
Add initial sphinx docs (#120) 2023-05-22 17:02:44 -07:00
			`.. code-block:: python`

Change the name to vLLM (#150) 2023-06-17 03:07:40 -07:00			`from vllm import LLM, SamplingParams`
Add initial sphinx docs (#120) 2023-05-22 17:02:44 -07:00
[Doc] Improve quickstart documentation (#9256) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-10-25 17:32:10 -04:00			The next section defines a list of input prompts and sampling parameters for text generation. The `sampling temperature <https://arxiv.org/html/2402.05201v1>`_ is set to ``0.8`` and the `nucleus sampling probability <https://en.wikipedia.org/wiki/Top-p_sampling>`_ is set to ``0.95``. You can find more information about the sampling parameters `here <https://docs.vllm.ai/en/stable/dev/sampling_params.html>`__.
Add quickstart guide (#148) 2023-06-18 01:26:12 +08:00
			`.. code-block:: python`

Add initial sphinx docs (#120) 2023-05-22 17:02:44 -07:00			`prompts = [`
			`"Hello, my name is",`
			`"The president of the United States is",`
			`"The capital of France is",`
			`"The future of AI is",`
			`]`
			`sampling_params = SamplingParams(temperature=0.8, top_p=0.95)`

[Doc] Improve quickstart documentation (#9256) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-10-25 17:32:10 -04:00			The :class:`~vllm.LLM` class initializes vLLM's engine and the `OPT-125M model <https://arxiv.org/abs/2205.01068>`_ for offline inference. The list of supported models can be found :ref:`here <supported_models>`.
Add quickstart guide (#148) 2023-06-18 01:26:12 +08:00
			`.. code-block:: python`

Add initial sphinx docs (#120) 2023-05-22 17:02:44 -07:00			`llm = LLM(model="facebook/opt-125m")`

[Doc] Improve quickstart documentation (#9256) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-10-25 17:32:10 -04:00			`.. note::`

			By default, vLLM downloads models from `HuggingFace <https://huggingface.co/>`_. If you would like to use models from `ModelScope <https://www.modelscope.cn>`_, set the environment variable ``VLLM_USE_MODELSCOPE`` before initializing the engine.

			Now, the fun part! The outputs are generated using ``llm.generate``. It adds the input prompts to the vLLM engine's waiting queue and executes the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of ``RequestOutput`` objects, which include all of the output tokens.
Add quickstart guide (#148) 2023-06-18 01:26:12 +08:00
			`.. code-block:: python`

Add initial sphinx docs (#120) 2023-05-22 17:02:44 -07:00			`outputs = llm.generate(prompts, sampling_params)`

			`for output in outputs:`
			`prompt = output.prompt`
			`generated_text = output.outputs[0].text`
			`print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")`
Add quickstart guide (#148) 2023-06-18 01:26:12 +08:00
[Doc] Improve quickstart documentation (#9256) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-10-25 17:32:10 -04:00			`.. _openai_compatible_server:`
Add quickstart guide (#148) 2023-06-18 01:26:12 +08:00
			`OpenAI-Compatible Server`
			`------------------------`

[Feature] Simple API token authentication and pluggable middlewares (#1106) 2024-01-23 18:13:00 -05:00			`vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.`
[Doc] Improve quickstart documentation (#9256) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-10-25 17:32:10 -04:00			By default, it starts the server at ``http://localhost:8000``. You can specify the address with ``--host`` and ``--port`` arguments. The server currently hosts one model at a time and implements endpoints such as `list models <https://platform.openai.com/docs/api-reference/models/list>`_, `create chat completion <https://platform.openai.com/docs/api-reference/chat/completions/create>`_, and `create completion <https://platform.openai.com/docs/api-reference/completions/create>`_ endpoints.
Add quickstart guide (#148) 2023-06-18 01:26:12 +08:00
[Doc] Improve quickstart documentation (#9256) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-10-25 17:32:10 -04:00			Run the following command to start the vLLM server with the `Qwen2.5-1.5B-Instruct <https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct>`_ model:
Add quickstart guide (#148) 2023-06-18 01:26:12 +08:00
			`.. code-block:: console`

[Doc] Improve quickstart documentation (#9256) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-10-25 17:32:10 -04:00			`$ vllm serve Qwen/Qwen2.5-1.5B-Instruct`
Add quickstart guide (#148) 2023-06-18 01:26:12 +08:00
[Doc] Improve quickstart documentation (#9256) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-10-25 17:32:10 -04:00			`.. note::`
Support chat template and `echo` for chat API (#1756) 2023-11-30 19:43:13 -05:00
[Doc] Improve quickstart documentation (#9256) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-10-25 17:32:10 -04:00			By default, the server uses a predefined chat template stored in the tokenizer. You can learn about overriding it `here <https://github.com/vllm-project/vllm/blob/main/docs/source/serving/openai_compatible_server.md#chat-template>`__.
Add quickstart guide (#148) 2023-06-18 01:26:12 +08:00
[Doc] Improve quickstart documentation (#9256) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-10-25 17:32:10 -04:00			`This server can be queried in the same format as OpenAI API. For example, to list the models:`
Add quickstart guide (#148) 2023-06-18 01:26:12 +08:00
			`.. code-block:: console`

			`$ curl http://localhost:8000/v1/models`

[Feature] Simple API token authentication and pluggable middlewares (#1106) 2024-01-23 18:13:00 -05:00			You can pass in the argument ``--api-key`` or environment variable ``VLLM_API_KEY`` to enable the server to check for API key in the header.

[Doc] Improve quickstart documentation (#9256) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-10-25 17:32:10 -04:00			`OpenAI Completions API with vLLM`
			`~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
Support chat template and `echo` for chat API (#1756) 2023-11-30 19:43:13 -05:00
[Doc] Improve quickstart documentation (#9256) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-10-25 17:32:10 -04:00			`Once your server is started, you can query the model with input prompts:`
Add quickstart guide (#148) 2023-06-18 01:26:12 +08:00
			`.. code-block:: console`

			`$ curl http://localhost:8000/v1/completions \`
			`$ -H "Content-Type: application/json" \`
			`$ -d '{`
[Doc] Improve quickstart documentation (#9256) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-10-25 17:32:10 -04:00			`$ "model": "Qwen/Qwen2.5-1.5B-Instruct",`
Add quickstart guide (#148) 2023-06-18 01:26:12 +08:00			`$ "prompt": "San Francisco is a",`
			`$ "max_tokens": 7,`
			`$ "temperature": 0`
			`$ }'`

			Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the ``openai`` python package:

			`.. code-block:: python`

chore(examples-docs): upgrade to OpenAI V1 (#1785) 2023-12-03 10:11:22 +01:00			`from openai import OpenAI`

Add quickstart guide (#148) 2023-06-18 01:26:12 +08:00			`# Modify OpenAI's API key and API base to use vLLM's API server.`
chore(examples-docs): upgrade to OpenAI V1 (#1785) 2023-12-03 10:11:22 +01:00			`openai_api_key = "EMPTY"`
			`openai_api_base = "http://localhost:8000/v1"`
			`client = OpenAI(`
			`api_key=openai_api_key,`
			`base_url=openai_api_base,`
			`)`
[Doc] Improve quickstart documentation (#9256) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-10-25 17:32:10 -04:00			`completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct",`
Add quickstart guide (#148) 2023-06-18 01:26:12 +08:00			`prompt="San Francisco is a")`
			`print("Completion result:", completion)`

[Doc] Improve quickstart documentation (#9256) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-10-25 17:32:10 -04:00			A more detailed client example can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py>`__.
Support chat template and `echo` for chat API (#1756) 2023-11-30 19:43:13 -05:00
[Frontend] Chat-based Embeddings API (#9759) 2024-11-01 16:13:35 +08:00			`OpenAI Chat Completions API with vLLM`
			`~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
Support chat template and `echo` for chat API (#1756) 2023-11-30 19:43:13 -05:00
[Frontend] Chat-based Embeddings API (#9759) 2024-11-01 16:13:35 +08:00			`vLLM is designed to also support the OpenAI Chat Completions API. The chat interface is a more dynamic, interactive way to communicate with the model, allowing back-and-forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed explanations.`
Support chat template and `echo` for chat API (#1756) 2023-11-30 19:43:13 -05:00
[Doc] Improve quickstart documentation (#9256) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-10-25 17:32:10 -04:00			You can use the `create chat completion <https://platform.openai.com/docs/api-reference/chat/completions/create>`_ endpoint to interact with the model:
Support chat template and `echo` for chat API (#1756) 2023-11-30 19:43:13 -05:00
			`.. code-block:: console`

			`$ curl http://localhost:8000/v1/chat/completions \`
			`$ -H "Content-Type: application/json" \`
			`$ -d '{`
[Doc] Improve quickstart documentation (#9256) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-10-25 17:32:10 -04:00			`$ "model": "Qwen/Qwen2.5-1.5B-Instruct",`
Support chat template and `echo` for chat API (#1756) 2023-11-30 19:43:13 -05:00			`$ "messages": [`
			`$ {"role": "system", "content": "You are a helpful assistant."},`
			`$ {"role": "user", "content": "Who won the world series in 2020?"}`
			`$ ]`
			`$ }'`

[Frontend] Chat-based Embeddings API (#9759) 2024-11-01 16:13:35 +08:00			Alternatively, you can use the ``openai`` python package:
Support chat template and `echo` for chat API (#1756) 2023-11-30 19:43:13 -05:00
			`.. code-block:: python`

chore(examples-docs): upgrade to OpenAI V1 (#1785) 2023-12-03 10:11:22 +01:00			`from openai import OpenAI`
Support chat template and `echo` for chat API (#1756) 2023-11-30 19:43:13 -05:00			`# Set OpenAI's API key and API base to use vLLM's API server.`
chore(examples-docs): upgrade to OpenAI V1 (#1785) 2023-12-03 10:11:22 +01:00			`openai_api_key = "EMPTY"`
			`openai_api_base = "http://localhost:8000/v1"`

			`client = OpenAI(`
			`api_key=openai_api_key,`
			`base_url=openai_api_base,`
			`)`

			`chat_response = client.chat.completions.create(`
[Doc] Improve quickstart documentation (#9256) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-10-25 17:32:10 -04:00			`model="Qwen/Qwen2.5-1.5B-Instruct",`
Support chat template and `echo` for chat API (#1756) 2023-11-30 19:43:13 -05:00			`messages=[`
			`{"role": "system", "content": "You are a helpful assistant."},`
			`{"role": "user", "content": "Tell me a joke."},`
			`]`
			`)`
			`print("Chat response:", chat_response)`