Replace "online inference" with "online serving" (#11923)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
parent
ef725feafc
commit
d85c47d6ad
@ -61,7 +61,7 @@ function cpu_tests() {
|
|||||||
pytest -s -v -k cpu_model \
|
pytest -s -v -k cpu_model \
|
||||||
tests/basic_correctness/test_chunked_prefill.py"
|
tests/basic_correctness/test_chunked_prefill.py"
|
||||||
|
|
||||||
# online inference
|
# online serving
|
||||||
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
|
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
|
||||||
set -e
|
set -e
|
||||||
export VLLM_CPU_KVCACHE_SPACE=10
|
export VLLM_CPU_KVCACHE_SPACE=10
|
||||||
|
@ -5,7 +5,7 @@
|
|||||||
vLLM supports the generation of structured outputs using [outlines](https://github.com/dottxt-ai/outlines), [lm-format-enforcer](https://github.com/noamgat/lm-format-enforcer), or [xgrammar](https://github.com/mlc-ai/xgrammar) as backends for the guided decoding.
|
vLLM supports the generation of structured outputs using [outlines](https://github.com/dottxt-ai/outlines), [lm-format-enforcer](https://github.com/noamgat/lm-format-enforcer), or [xgrammar](https://github.com/mlc-ai/xgrammar) as backends for the guided decoding.
|
||||||
This document shows you some examples of the different options that are available to generate structured outputs.
|
This document shows you some examples of the different options that are available to generate structured outputs.
|
||||||
|
|
||||||
## Online Inference (OpenAI API)
|
## Online Serving (OpenAI API)
|
||||||
|
|
||||||
You can generate structured outputs using the OpenAI's [Completions](https://platform.openai.com/docs/api-reference/completions) and [Chat](https://platform.openai.com/docs/api-reference/chat) API.
|
You can generate structured outputs using the OpenAI's [Completions](https://platform.openai.com/docs/api-reference/completions) and [Chat](https://platform.openai.com/docs/api-reference/chat) API.
|
||||||
|
|
||||||
@ -239,7 +239,7 @@ The main available options inside `GuidedDecodingParams` are:
|
|||||||
- `backend`
|
- `backend`
|
||||||
- `whitespace_pattern`
|
- `whitespace_pattern`
|
||||||
|
|
||||||
These parameters can be used in the same way as the parameters from the Online Inference examples above.
|
These parameters can be used in the same way as the parameters from the Online Serving examples above.
|
||||||
One example for the usage of the `choices` parameter is shown below:
|
One example for the usage of the `choices` parameter is shown below:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
|
@ -83,7 +83,7 @@ $ python setup.py develop
|
|||||||
## Supported Features
|
## Supported Features
|
||||||
|
|
||||||
- [Offline inference](#offline-inference)
|
- [Offline inference](#offline-inference)
|
||||||
- Online inference via [OpenAI-Compatible Server](#openai-compatible-server)
|
- Online serving via [OpenAI-Compatible Server](#openai-compatible-server)
|
||||||
- HPU autodetection - no need to manually select device within vLLM
|
- HPU autodetection - no need to manually select device within vLLM
|
||||||
- Paged KV cache with algorithms enabled for Intel Gaudi accelerators
|
- Paged KV cache with algorithms enabled for Intel Gaudi accelerators
|
||||||
- Custom Intel Gaudi implementations of Paged Attention, KV cache ops,
|
- Custom Intel Gaudi implementations of Paged Attention, KV cache ops,
|
||||||
@ -385,5 +385,5 @@ the below:
|
|||||||
completely. With HPU Graphs disabled, you are trading latency and
|
completely. With HPU Graphs disabled, you are trading latency and
|
||||||
throughput at lower batches for potentially higher throughput on
|
throughput at lower batches for potentially higher throughput on
|
||||||
higher batches. You can do that by adding `--enforce-eager` flag to
|
higher batches. You can do that by adding `--enforce-eager` flag to
|
||||||
server (for online inference), or by passing `enforce_eager=True`
|
server (for online serving), or by passing `enforce_eager=True`
|
||||||
argument to LLM constructor (for offline inference).
|
argument to LLM constructor (for offline inference).
|
||||||
|
@ -5,7 +5,7 @@
|
|||||||
This guide will help you quickly get started with vLLM to perform:
|
This guide will help you quickly get started with vLLM to perform:
|
||||||
|
|
||||||
- [Offline batched inference](#quickstart-offline)
|
- [Offline batched inference](#quickstart-offline)
|
||||||
- [Online inference using OpenAI-compatible server](#quickstart-online)
|
- [Online serving using OpenAI-compatible server](#quickstart-online)
|
||||||
|
|
||||||
## Prerequisites
|
## Prerequisites
|
||||||
|
|
||||||
|
@ -118,7 +118,7 @@ print("Loaded chat template:", custom_template)
|
|||||||
outputs = llm.chat(conversation, chat_template=custom_template)
|
outputs = llm.chat(conversation, chat_template=custom_template)
|
||||||
```
|
```
|
||||||
|
|
||||||
## Online Inference
|
## Online Serving
|
||||||
|
|
||||||
Our [OpenAI-Compatible Server](#openai-compatible-server) provides endpoints that correspond to the offline APIs:
|
Our [OpenAI-Compatible Server](#openai-compatible-server) provides endpoints that correspond to the offline APIs:
|
||||||
|
|
||||||
|
@ -127,7 +127,7 @@ print(f"Score: {score}")
|
|||||||
|
|
||||||
A code example can be found here: <gh-file:examples/offline_inference/offline_inference_scoring.py>
|
A code example can be found here: <gh-file:examples/offline_inference/offline_inference_scoring.py>
|
||||||
|
|
||||||
## Online Inference
|
## Online Serving
|
||||||
|
|
||||||
Our [OpenAI-Compatible Server](#openai-compatible-server) provides endpoints that correspond to the offline APIs:
|
Our [OpenAI-Compatible Server](#openai-compatible-server) provides endpoints that correspond to the offline APIs:
|
||||||
|
|
||||||
|
@ -552,7 +552,7 @@ See [this page](#multimodal-inputs) on how to pass multi-modal inputs to the mod
|
|||||||
|
|
||||||
````{important}
|
````{important}
|
||||||
To enable multiple multi-modal items per text prompt, you have to set `limit_mm_per_prompt` (offline inference)
|
To enable multiple multi-modal items per text prompt, you have to set `limit_mm_per_prompt` (offline inference)
|
||||||
or `--limit-mm-per-prompt` (online inference). For example, to enable passing up to 4 images per text prompt:
|
or `--limit-mm-per-prompt` (online serving). For example, to enable passing up to 4 images per text prompt:
|
||||||
|
|
||||||
Offline inference:
|
Offline inference:
|
||||||
```python
|
```python
|
||||||
@ -562,7 +562,7 @@ llm = LLM(
|
|||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
Online inference:
|
Online serving:
|
||||||
```bash
|
```bash
|
||||||
vllm serve Qwen/Qwen2-VL-7B-Instruct --limit-mm-per-prompt image=4
|
vllm serve Qwen/Qwen2-VL-7B-Instruct --limit-mm-per-prompt image=4
|
||||||
```
|
```
|
||||||
|
@ -199,7 +199,7 @@ for o in outputs:
|
|||||||
print(generated_text)
|
print(generated_text)
|
||||||
```
|
```
|
||||||
|
|
||||||
## Online Inference
|
## Online Serving
|
||||||
|
|
||||||
Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat).
|
Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat).
|
||||||
|
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
"""An example showing how to use vLLM to serve multimodal models
|
"""An example showing how to use vLLM to serve multimodal models
|
||||||
and run online inference with OpenAI client.
|
and run online serving with OpenAI client.
|
||||||
|
|
||||||
Launch the vLLM server with the following command:
|
Launch the vLLM server with the following command:
|
||||||
|
|
||||||
@ -309,7 +309,7 @@ def main(args) -> None:
|
|||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
parser = FlexibleArgumentParser(
|
parser = FlexibleArgumentParser(
|
||||||
description='Demo on using OpenAI client for online inference with '
|
description='Demo on using OpenAI client for online serving with '
|
||||||
'multimodal language models served with vLLM.')
|
'multimodal language models served with vLLM.')
|
||||||
parser.add_argument('--chat-type',
|
parser.add_argument('--chat-type',
|
||||||
'-c',
|
'-c',
|
||||||
|
@ -237,8 +237,8 @@ def test_models_with_multiple_audios(vllm_runner, audio_assets, dtype: str,
|
|||||||
|
|
||||||
|
|
||||||
@pytest.mark.asyncio
|
@pytest.mark.asyncio
|
||||||
async def test_online_inference(client, audio_assets):
|
async def test_online_serving(client, audio_assets):
|
||||||
"""Exercises online inference with/without chunked prefill enabled."""
|
"""Exercises online serving with/without chunked prefill enabled."""
|
||||||
|
|
||||||
messages = [{
|
messages = [{
|
||||||
"role":
|
"role":
|
||||||
|
@ -1068,7 +1068,7 @@ def input_processor_for_molmo(ctx: InputContext, inputs: DecoderOnlyInputs):
|
|||||||
trust_remote_code=model_config.trust_remote_code)
|
trust_remote_code=model_config.trust_remote_code)
|
||||||
|
|
||||||
# NOTE: message formatting for raw text prompt is only applied for
|
# NOTE: message formatting for raw text prompt is only applied for
|
||||||
# offline inference; for online inference, the prompt is always in
|
# offline inference; for online serving, the prompt is always in
|
||||||
# instruction format and tokenized.
|
# instruction format and tokenized.
|
||||||
if prompt is not None and re.match(r"^User:[\s\S]*?(Assistant:)*$",
|
if prompt is not None and re.match(r"^User:[\s\S]*?(Assistant:)*$",
|
||||||
prompt):
|
prompt):
|
||||||
|
Loading…
x
Reference in New Issue
Block a user