
Signed-off-by: mgoin <michael@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
364 lines
11 KiB
Markdown
364 lines
11 KiB
Markdown
# OpenAI Compatible Server
|
|
|
|
vLLM provides an HTTP server that implements OpenAI's [Completions](https://platform.openai.com/docs/api-reference/completions) and [Chat](https://platform.openai.com/docs/api-reference/chat) API.
|
|
|
|
You can start the server using Python, or using [Docker](deploying_with_docker.rst):
|
|
```bash
|
|
vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
|
|
```
|
|
|
|
To call the server, you can use the official OpenAI Python client library, or any other HTTP client.
|
|
```python
|
|
from openai import OpenAI
|
|
client = OpenAI(
|
|
base_url="http://localhost:8000/v1",
|
|
api_key="token-abc123",
|
|
)
|
|
|
|
completion = client.chat.completions.create(
|
|
model="NousResearch/Meta-Llama-3-8B-Instruct",
|
|
messages=[
|
|
{"role": "user", "content": "Hello!"}
|
|
]
|
|
)
|
|
|
|
print(completion.choices[0].message)
|
|
```
|
|
|
|
## API Reference
|
|
|
|
We currently support the following OpenAI APIs:
|
|
|
|
- [Completions API](https://platform.openai.com/docs/api-reference/completions)
|
|
- *Note: `suffix` parameter is not supported.*
|
|
- [Chat Completions API](https://platform.openai.com/docs/api-reference/chat)
|
|
- [Vision](https://platform.openai.com/docs/guides/vision)-related parameters are supported; see [Multimodal Inputs](../usage/multimodal_inputs.rst).
|
|
- *Note: `image_url.detail` parameter is not supported.*
|
|
- We also support `audio_url` content type for audio files.
|
|
- Refer to [vllm.entrypoints.chat_utils](https://github.com/vllm-project/vllm/tree/main/vllm/entrypoints/chat_utils.py) for the exact schema.
|
|
- *TODO: Support `input_audio` content type as defined [here](https://github.com/openai/openai-python/blob/v1.52.2/src/openai/types/chat/chat_completion_content_part_input_audio_param.py).*
|
|
- *Note: `parallel_tool_calls` and `user` parameters are ignored.*
|
|
- [Embeddings API](https://platform.openai.com/docs/api-reference/embeddings)
|
|
- Instead of `inputs`, you can pass in a list of `messages` (same schema as Chat Completions API),
|
|
which will be treated as a single prompt to the model according to its chat template.
|
|
- This enables multi-modal inputs to be passed to embedding models, see [this page](../usage/multimodal_inputs.rst) for details.
|
|
- *Note: You should run `vllm serve` with `--task embedding` to ensure that the model is being run in embedding mode.*
|
|
|
|
## Score API for Cross Encoder Models
|
|
|
|
vLLM supports *cross encoders models* at the **/v1/score** endpoint, which is not an OpenAI API standard endpoint. You can find the documentation for these kind of models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).
|
|
|
|
A ***Cross Encoder*** takes exactly two sentences / texts as input and either predicts a score or label for this sentence pair. It can for example predict the similarity of the sentence pair on a scale of 0 … 1.
|
|
|
|
### Example of usage for a pair of a string and a list of texts
|
|
|
|
In this case, the model will compare the first given text to each of the texts containing the list.
|
|
|
|
```bash
|
|
curl -X 'POST' \
|
|
'http://127.0.0.1:8000/v1/score' \
|
|
-H 'accept: application/json' \
|
|
-H 'Content-Type: application/json' \
|
|
-d '{
|
|
"model": "BAAI/bge-reranker-v2-m3",
|
|
"text_1": "What is the capital of France?",
|
|
"text_2": [
|
|
"The capital of Brazil is Brasilia.",
|
|
"The capital of France is Paris."
|
|
]
|
|
}'
|
|
```
|
|
|
|
Response:
|
|
|
|
```bash
|
|
{
|
|
"id": "score-request-id",
|
|
"object": "list",
|
|
"created": 693570,
|
|
"model": "BAAI/bge-reranker-v2-m3",
|
|
"data": [
|
|
{
|
|
"index": 0,
|
|
"object": "score",
|
|
"score": [
|
|
0.001094818115234375
|
|
]
|
|
},
|
|
{
|
|
"index": 1,
|
|
"object": "score",
|
|
"score": [
|
|
1
|
|
]
|
|
}
|
|
],
|
|
"usage": {}
|
|
}
|
|
```
|
|
|
|
### Example of usage for a pair of two lists of texts
|
|
|
|
In this case, the model will compare the one by one, making pairs by same index correspondent in each list.
|
|
|
|
```bash
|
|
curl -X 'POST' \
|
|
'http://127.0.0.1:8000/v1/score' \
|
|
-H 'accept: application/json' \
|
|
-H 'Content-Type: application/json' \
|
|
-d '{
|
|
"model": "BAAI/bge-reranker-v2-m3",
|
|
"encoding_format": "float",
|
|
"text_1": [
|
|
"What is the capital of Brazil?",
|
|
"What is the capital of France?"
|
|
],
|
|
"text_2": [
|
|
"The capital of Brazil is Brasilia.",
|
|
"The capital of France is Paris."
|
|
]
|
|
}'
|
|
```
|
|
|
|
Response:
|
|
|
|
```bash
|
|
{
|
|
"id": "score-request-id",
|
|
"object": "list",
|
|
"created": 693447,
|
|
"model": "BAAI/bge-reranker-v2-m3",
|
|
"data": [
|
|
{
|
|
"index": 0,
|
|
"object": "score",
|
|
"score": [
|
|
1
|
|
]
|
|
},
|
|
{
|
|
"index": 1,
|
|
"object": "score",
|
|
"score": [
|
|
1
|
|
]
|
|
}
|
|
],
|
|
"usage": {}
|
|
}
|
|
```
|
|
|
|
### Example of usage for a pair of two strings
|
|
|
|
In this case, the model will compare the strings of texts.
|
|
|
|
```bash
|
|
curl -X 'POST' \
|
|
'http://127.0.0.1:8000/v1/score' \
|
|
-H 'accept: application/json' \
|
|
-H 'Content-Type: application/json' \
|
|
-d '{
|
|
"model": "BAAI/bge-reranker-v2-m3",
|
|
"encoding_format": "float",
|
|
"text_1": "What is the capital of France?",
|
|
"text_2": "The capital of France is Paris."
|
|
}'
|
|
```
|
|
|
|
Response:
|
|
|
|
```bash
|
|
{
|
|
"id": "score-request-id",
|
|
"object": "list",
|
|
"created": 693447,
|
|
"model": "BAAI/bge-reranker-v2-m3",
|
|
"data": [
|
|
{
|
|
"index": 0,
|
|
"object": "score",
|
|
"score": [
|
|
1
|
|
]
|
|
}
|
|
],
|
|
"usage": {}
|
|
}
|
|
```
|
|
|
|
## Extra Parameters
|
|
|
|
vLLM supports a set of parameters that are not part of the OpenAI API.
|
|
In order to use them, you can pass them as extra parameters in the OpenAI client.
|
|
Or directly merge them into the JSON payload if you are using HTTP call directly.
|
|
|
|
```python
|
|
completion = client.chat.completions.create(
|
|
model="NousResearch/Meta-Llama-3-8B-Instruct",
|
|
messages=[
|
|
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
|
|
],
|
|
extra_body={
|
|
"guided_choice": ["positive", "negative"]
|
|
}
|
|
)
|
|
```
|
|
|
|
### Extra HTTP Headers
|
|
|
|
Only `X-Request-Id` HTTP request header is supported for now.
|
|
|
|
```python
|
|
completion = client.chat.completions.create(
|
|
model="NousResearch/Meta-Llama-3-8B-Instruct",
|
|
messages=[
|
|
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
|
|
],
|
|
extra_headers={
|
|
"x-request-id": "sentiment-classification-00001",
|
|
}
|
|
)
|
|
print(completion._request_id)
|
|
|
|
completion = client.completions.create(
|
|
model="NousResearch/Meta-Llama-3-8B-Instruct",
|
|
prompt="A robot may not injure a human being",
|
|
extra_headers={
|
|
"x-request-id": "completion-test",
|
|
}
|
|
)
|
|
print(completion._request_id)
|
|
```
|
|
|
|
### Extra Parameters for Completions API
|
|
|
|
The following [sampling parameters (click through to see documentation)](../dev/sampling_params.rst) are supported.
|
|
|
|
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
|
:language: python
|
|
:start-after: begin-completion-sampling-params
|
|
:end-before: end-completion-sampling-params
|
|
```
|
|
|
|
The following extra parameters are supported:
|
|
|
|
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
|
:language: python
|
|
:start-after: begin-completion-extra-params
|
|
:end-before: end-completion-extra-params
|
|
```
|
|
|
|
### Extra Parameters for Chat Completions API
|
|
|
|
The following [sampling parameters (click through to see documentation)](../dev/sampling_params.rst) are supported.
|
|
|
|
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
|
:language: python
|
|
:start-after: begin-chat-completion-sampling-params
|
|
:end-before: end-chat-completion-sampling-params
|
|
```
|
|
|
|
The following extra parameters are supported:
|
|
|
|
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
|
:language: python
|
|
:start-after: begin-chat-completion-extra-params
|
|
:end-before: end-chat-completion-extra-params
|
|
```
|
|
|
|
### Extra Parameters for Embeddings API
|
|
|
|
The following [pooling parameters (click through to see documentation)](../dev/pooling_params.rst) are supported.
|
|
|
|
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
|
:language: python
|
|
:start-after: begin-embedding-pooling-params
|
|
:end-before: end-embedding-pooling-params
|
|
```
|
|
|
|
The following extra parameters are supported:
|
|
|
|
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
|
|
:language: python
|
|
:start-after: begin-embedding-extra-params
|
|
:end-before: end-embedding-extra-params
|
|
```
|
|
|
|
## Chat Template
|
|
|
|
In order for the language model to support chat protocol, vLLM requires the model to include
|
|
a chat template in its tokenizer configuration. The chat template is a Jinja2 template that
|
|
specifies how are roles, messages, and other chat-specific tokens are encoded in the input.
|
|
|
|
An example chat template for `NousResearch/Meta-Llama-3-8B-Instruct` can be found [here](https://github.com/meta-llama/llama3?tab=readme-ov-file#instruction-tuned-models)
|
|
|
|
Some models do not provide a chat template even though they are instruction/chat fine-tuned. For those model,
|
|
you can manually specify their chat template in the `--chat-template` parameter with the file path to the chat
|
|
template, or the template in string form. Without a chat template, the server will not be able to process chat
|
|
and all chat requests will error.
|
|
|
|
```bash
|
|
vllm serve <model> --chat-template ./path-to-chat-template.jinja
|
|
```
|
|
|
|
vLLM community provides a set of chat templates for popular models. You can find them in the examples
|
|
directory [here](https://github.com/vllm-project/vllm/tree/main/examples/)
|
|
|
|
With the inclusion of multi-modal chat APIs, the OpenAI spec now accepts chat messages in a new format which specifies
|
|
both a `type` and a `text` field. An example is provided below:
|
|
```python
|
|
completion = client.chat.completions.create(
|
|
model="NousResearch/Meta-Llama-3-8B-Instruct",
|
|
messages=[
|
|
{"role": "user", "content": [{"type": "text", "text": "Classify this sentiment: vLLM is wonderful!"}]}
|
|
]
|
|
)
|
|
```
|
|
|
|
Most chat templates for LLMs expect the `content` field to be a string, but there are some newer models like
|
|
`meta-llama/Llama-Guard-3-1B` that expect the content to be formatted according to the OpenAI schema in the
|
|
request. vLLM provides best-effort support to detect this automatically, which is logged as a string like
|
|
*"Detected the chat template content format to be..."*, and internally converts incoming requests to match
|
|
the detected format, which can be one of:
|
|
|
|
- `"string"`: A string.
|
|
- Example: `"Hello world"`
|
|
- `"openai"`: A list of dictionaries, similar to OpenAI schema.
|
|
- Example: `[{"type": "text", "text": "Hello world!"}]`
|
|
|
|
If the result is not what you expect, you can set the `--chat-template-content-format` CLI argument
|
|
to override which format to use.
|
|
|
|
## Command line arguments for the server
|
|
|
|
```{argparse}
|
|
:module: vllm.entrypoints.openai.cli_args
|
|
:func: create_parser_for_docs
|
|
:prog: vllm serve
|
|
```
|
|
|
|
|
|
### Config file
|
|
|
|
The `serve` module can also accept arguments from a config file in
|
|
`yaml` format. The arguments in the yaml must be specified using the
|
|
long form of the argument outlined [here](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#command-line-arguments-for-the-server):
|
|
|
|
For example:
|
|
|
|
```yaml
|
|
# config.yaml
|
|
|
|
host: "127.0.0.1"
|
|
port: 6379
|
|
uvicorn-log-level: "info"
|
|
```
|
|
|
|
```bash
|
|
$ vllm serve SOME_MODEL --config config.yaml
|
|
```
|
|
---
|
|
**NOTE**
|
|
In case an argument is supplied simultaneously using command line and the config file, the value from the commandline will take precedence.
|
|
The order of priorities is `command line > config file values > defaults`.
|