[CI/Build] Auto-fix Markdown files (#12941)
This commit is contained in:
parent
4c8dd12ef3
commit
8a69e0e20e
@ -1,15 +1,13 @@
|
|||||||
# vLLM benchmark suite
|
# vLLM benchmark suite
|
||||||
|
|
||||||
|
|
||||||
## Introduction
|
## Introduction
|
||||||
|
|
||||||
This directory contains two sets of benchmark for vllm.
|
This directory contains two sets of benchmark for vllm.
|
||||||
|
|
||||||
- Performance benchmark: benchmark vllm's performance under various workload, for **developers** to gain clarity on whether their PR improves/degrades vllm's performance
|
- Performance benchmark: benchmark vllm's performance under various workload, for **developers** to gain clarity on whether their PR improves/degrades vllm's performance
|
||||||
- Nightly benchmark: compare vllm's performance against alternatives (tgi, trt-llm and lmdeploy), for **the public** to know when to choose vllm.
|
- Nightly benchmark: compare vllm's performance against alternatives (tgi, trt-llm and lmdeploy), for **the public** to know when to choose vllm.
|
||||||
|
|
||||||
|
See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.
|
||||||
See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.
|
|
||||||
|
|
||||||
|
|
||||||
## Performance benchmark quick overview
|
## Performance benchmark quick overview
|
||||||
|
|
||||||
@ -19,17 +17,14 @@ See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performan
|
|||||||
|
|
||||||
**For benchmarking developers**: please try your best to constraint the duration of benchmarking to about 1 hr so that it won't take forever to run.
|
**For benchmarking developers**: please try your best to constraint the duration of benchmarking to about 1 hr so that it won't take forever to run.
|
||||||
|
|
||||||
|
|
||||||
## Nightly benchmark quick overview
|
## Nightly benchmark quick overview
|
||||||
|
|
||||||
**Benchmarking Coverage**: Fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!) on Llama-3 8B, 70B and Mixtral 8x7B.
|
**Benchmarking Coverage**: Fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!) on Llama-3 8B, 70B and Mixtral 8x7B.
|
||||||
|
|
||||||
**Benchmarking engines**: vllm, TGI, trt-llm and lmdeploy.
|
**Benchmarking engines**: vllm, TGI, trt-llm and lmdeploy.
|
||||||
|
|
||||||
**Benchmarking Duration**: about 3.5hrs.
|
**Benchmarking Duration**: about 3.5hrs.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Trigger the benchmark
|
## Trigger the benchmark
|
||||||
|
|
||||||
Performance benchmark will be triggered when:
|
Performance benchmark will be triggered when:
|
||||||
@ -39,16 +34,11 @@ Performance benchmark will be triggered when:
|
|||||||
Nightly benchmark will be triggered when:
|
Nightly benchmark will be triggered when:
|
||||||
- Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label.
|
- Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Performance benchmark details
|
## Performance benchmark details
|
||||||
|
|
||||||
|
|
||||||
See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
|
See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
|
||||||
|
|
||||||
|
### Latency test
|
||||||
#### Latency test
|
|
||||||
|
|
||||||
Here is an example of one test inside `latency-tests.json`:
|
Here is an example of one test inside `latency-tests.json`:
|
||||||
|
|
||||||
@ -68,23 +58,25 @@ Here is an example of one test inside `latency-tests.json`:
|
|||||||
```
|
```
|
||||||
|
|
||||||
In this example:
|
In this example:
|
||||||
- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
|
|
||||||
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
|
- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
|
||||||
|
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
|
||||||
|
|
||||||
Note that the performance numbers are highly sensitive to the value of the parameters. Please make sure the parameters are set correctly.
|
Note that the performance numbers are highly sensitive to the value of the parameters. Please make sure the parameters are set correctly.
|
||||||
|
|
||||||
WARNING: The benchmarking script will save json results by itself, so please do not configure `--output-json` parameter in the json file.
|
WARNING: The benchmarking script will save json results by itself, so please do not configure `--output-json` parameter in the json file.
|
||||||
|
|
||||||
|
### Throughput test
|
||||||
|
|
||||||
#### Throughput test
|
|
||||||
The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `benchmark_throughput.py`.
|
The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `benchmark_throughput.py`.
|
||||||
|
|
||||||
The number of this test is also stable -- a slight change on the value of this number might vary the performance numbers by a lot.
|
The number of this test is also stable -- a slight change on the value of this number might vary the performance numbers by a lot.
|
||||||
|
|
||||||
#### Serving test
|
### Serving test
|
||||||
|
|
||||||
We test the throughput by using `benchmark_serving.py` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:
|
We test the throughput by using `benchmark_serving.py` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:
|
||||||
|
|
||||||
```
|
```json
|
||||||
[
|
[
|
||||||
{
|
{
|
||||||
"test_name": "serving_llama8B_tp1_sharegpt",
|
"test_name": "serving_llama8B_tp1_sharegpt",
|
||||||
@ -109,6 +101,7 @@ We test the throughput by using `benchmark_serving.py` with request rate = inf t
|
|||||||
```
|
```
|
||||||
|
|
||||||
Inside this example:
|
Inside this example:
|
||||||
|
|
||||||
- The `test_name` attribute is also a unique identifier for the test. It must start with `serving_`.
|
- The `test_name` attribute is also a unique identifier for the test. It must start with `serving_`.
|
||||||
- The `server-parameters` includes the command line arguments for vLLM server.
|
- The `server-parameters` includes the command line arguments for vLLM server.
|
||||||
- The `client-parameters` includes the command line arguments for `benchmark_serving.py`.
|
- The `client-parameters` includes the command line arguments for `benchmark_serving.py`.
|
||||||
@ -118,36 +111,33 @@ The number of this test is less stable compared to the delay and latency benchma
|
|||||||
|
|
||||||
WARNING: The benchmarking script will save json results by itself, so please do not configure `--save-results` or other results-saving-related parameters in `serving-tests.json`.
|
WARNING: The benchmarking script will save json results by itself, so please do not configure `--save-results` or other results-saving-related parameters in `serving-tests.json`.
|
||||||
|
|
||||||
#### Visualizing the results
|
### Visualizing the results
|
||||||
|
|
||||||
The `convert-results-json-to-markdown.py` helps you put the benchmarking results inside a markdown table, by formatting [descriptions.md](tests/descriptions.md) with real benchmarking results.
|
The `convert-results-json-to-markdown.py` helps you put the benchmarking results inside a markdown table, by formatting [descriptions.md](tests/descriptions.md) with real benchmarking results.
|
||||||
You can find the result presented as a table inside the `buildkite/performance-benchmark` job page.
|
You can find the result presented as a table inside the `buildkite/performance-benchmark` job page.
|
||||||
If you do not see the table, please wait till the benchmark finish running.
|
If you do not see the table, please wait till the benchmark finish running.
|
||||||
The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.
|
The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.
|
||||||
The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking.
|
The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Nightly test details
|
## Nightly test details
|
||||||
|
|
||||||
See [nightly-descriptions.md](nightly-descriptions.md) for the detailed description on test workload, models and docker containers of benchmarking other llm engines.
|
See [nightly-descriptions.md](nightly-descriptions.md) for the detailed description on test workload, models and docker containers of benchmarking other llm engines.
|
||||||
|
|
||||||
|
### Workflow
|
||||||
|
|
||||||
#### Workflow
|
- The [nightly-pipeline.yaml](nightly-pipeline.yaml) specifies the docker containers for different LLM serving engines.
|
||||||
|
|
||||||
- The [nightly-pipeline.yaml](nightly-pipeline.yaml) specifies the docker containers for different LLM serving engines.
|
|
||||||
- Inside each container, we run [run-nightly-suite.sh](run-nightly-suite.sh), which will probe the serving engine of the current container.
|
- Inside each container, we run [run-nightly-suite.sh](run-nightly-suite.sh), which will probe the serving engine of the current container.
|
||||||
- The `run-nightly-suite.sh` will redirect the request to `tests/run-[llm serving engine name]-nightly.sh`, which parses the workload described in [nightly-tests.json](tests/nightly-tests.json) and performs the benchmark.
|
- The `run-nightly-suite.sh` will redirect the request to `tests/run-[llm serving engine name]-nightly.sh`, which parses the workload described in [nightly-tests.json](tests/nightly-tests.json) and performs the benchmark.
|
||||||
- At last, we run [scripts/plot-nightly-results.py](scripts/plot-nightly-results.py) to collect and plot the final benchmarking results, and update the results to buildkite.
|
- At last, we run [scripts/plot-nightly-results.py](scripts/plot-nightly-results.py) to collect and plot the final benchmarking results, and update the results to buildkite.
|
||||||
|
|
||||||
#### Nightly tests
|
### Nightly tests
|
||||||
|
|
||||||
In [nightly-tests.json](tests/nightly-tests.json), we include the command line arguments for benchmarking commands, together with the benchmarking test cases. The format is highly similar to performance benchmark.
|
In [nightly-tests.json](tests/nightly-tests.json), we include the command line arguments for benchmarking commands, together with the benchmarking test cases. The format is highly similar to performance benchmark.
|
||||||
|
|
||||||
#### Docker containers
|
### Docker containers
|
||||||
|
|
||||||
The docker containers for benchmarking are specified in `nightly-pipeline.yaml`.
|
The docker containers for benchmarking are specified in `nightly-pipeline.yaml`.
|
||||||
|
|
||||||
WARNING: the docker versions are HARD-CODED and SHOULD BE ALIGNED WITH `nightly-descriptions.md`. The docker versions need to be hard-coded as there are several version-specific bug fixes inside `tests/run-[llm serving engine name]-nightly.sh`.
|
WARNING: the docker versions are HARD-CODED and SHOULD BE ALIGNED WITH `nightly-descriptions.md`. The docker versions need to be hard-coded as there are several version-specific bug fixes inside `tests/run-[llm serving engine name]-nightly.sh`.
|
||||||
|
|
||||||
WARNING: populating `trt-llm` to latest version is not easy, as it requires updating several protobuf files in [tensorrt-demo](https://github.com/neuralmagic/tensorrt-demo.git).
|
WARNING: populating `trt-llm` to latest version is not easy, as it requires updating several protobuf files in [tensorrt-demo](https://github.com/neuralmagic/tensorrt-demo.git).
|
||||||
|
|
||||||
|
@ -9,20 +9,19 @@ This file contains the downloading link for benchmarking results.
|
|||||||
|
|
||||||
Please download the visualization scripts in the post
|
Please download the visualization scripts in the post
|
||||||
|
|
||||||
|
|
||||||
## Results reproduction
|
## Results reproduction
|
||||||
|
|
||||||
- Find the docker we use in `benchmarking pipeline`
|
- Find the docker we use in `benchmarking pipeline`
|
||||||
- Deploy the docker, and inside the docker:
|
- Deploy the docker, and inside the docker:
|
||||||
- Download `nightly-benchmarks.zip`.
|
- Download `nightly-benchmarks.zip`.
|
||||||
- In the same folder, run the following code
|
- In the same folder, run the following code:
|
||||||
```
|
|
||||||
export HF_TOKEN=<your HF token>
|
```console
|
||||||
apt update
|
export HF_TOKEN=<your HF token>
|
||||||
apt install -y git
|
apt update
|
||||||
unzip nightly-benchmarks.zip
|
apt install -y git
|
||||||
VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
|
unzip nightly-benchmarks.zip
|
||||||
```
|
VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
|
||||||
|
```
|
||||||
|
|
||||||
And the results will be inside `./benchmarks/results`.
|
And the results will be inside `./benchmarks/results`.
|
||||||
|
|
||||||
|
@ -2,6 +2,7 @@
|
|||||||
# Nightly benchmark
|
# Nightly benchmark
|
||||||
|
|
||||||
This benchmark aims to:
|
This benchmark aims to:
|
||||||
|
|
||||||
- Provide performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and SGLang) leads in performance in what workload.
|
- Provide performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and SGLang) leads in performance in what workload.
|
||||||
- Be reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions.
|
- Be reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions.
|
||||||
|
|
||||||
@ -9,7 +10,6 @@ Latest results: [results link](https://blog.vllm.ai/2024/09/05/perf-update.html)
|
|||||||
|
|
||||||
Latest reproduction guilde: [github issue link](https://github.com/vllm-project/vllm/issues/8176)
|
Latest reproduction guilde: [github issue link](https://github.com/vllm-project/vllm/issues/8176)
|
||||||
|
|
||||||
|
|
||||||
## Setup
|
## Setup
|
||||||
|
|
||||||
- Docker images:
|
- Docker images:
|
||||||
@ -33,7 +33,7 @@ Latest reproduction guilde: [github issue link](https://github.com/vllm-project/
|
|||||||
- Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed.
|
- Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed.
|
||||||
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
|
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
|
||||||
|
|
||||||
# Known issues
|
## Known issues
|
||||||
|
|
||||||
- TRT-LLM crashes with Llama 3.1 8B [issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105).
|
- TRT-LLM crashes with Llama 3.1 8B [issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105).
|
||||||
- TGI does not support `ignore-eos` flag.
|
- TGI does not support `ignore-eos` flag.
|
||||||
|
@ -7,10 +7,8 @@
|
|||||||
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
|
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
|
||||||
- Evaluation metrics: end-to-end latency (mean, median, p99).
|
- Evaluation metrics: end-to-end latency (mean, median, p99).
|
||||||
|
|
||||||
|
|
||||||
{latency_tests_markdown_table}
|
{latency_tests_markdown_table}
|
||||||
|
|
||||||
|
|
||||||
## Throughput tests
|
## Throughput tests
|
||||||
|
|
||||||
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
|
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
|
||||||
@ -19,10 +17,8 @@
|
|||||||
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
|
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
|
||||||
- Evaluation metrics: throughput.
|
- Evaluation metrics: throughput.
|
||||||
|
|
||||||
|
|
||||||
{throughput_tests_markdown_table}
|
{throughput_tests_markdown_table}
|
||||||
|
|
||||||
|
|
||||||
## Serving tests
|
## Serving tests
|
||||||
|
|
||||||
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
|
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
|
||||||
@ -33,13 +29,11 @@
|
|||||||
- We also added a speculative decoding test for llama-3 70B, under QPS 2
|
- We also added a speculative decoding test for llama-3 70B, under QPS 2
|
||||||
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
|
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
|
||||||
|
|
||||||
|
|
||||||
{serving_tests_markdown_table}
|
{serving_tests_markdown_table}
|
||||||
|
|
||||||
|
|
||||||
## json version of the benchmarking tables
|
## json version of the benchmarking tables
|
||||||
|
|
||||||
This section contains the data of the markdown tables above in JSON format.
|
This section contains the data of the markdown tables above in JSON format.
|
||||||
You can load the benchmarking tables into pandas dataframes as follows:
|
You can load the benchmarking tables into pandas dataframes as follows:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
@ -54,9 +48,9 @@ serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])
|
|||||||
```
|
```
|
||||||
|
|
||||||
The json string for all benchmarking tables:
|
The json string for all benchmarking tables:
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{benchmarking_results_in_json_string}
|
{benchmarking_results_in_json_string}
|
||||||
```
|
```
|
||||||
|
|
||||||
You can also check the raw experiment data in the Artifact tab of the Buildkite page.
|
You can also check the raw experiment data in the Artifact tab of the Buildkite page.
|
||||||
|
|
||||||
|
3
.github/PULL_REQUEST_TEMPLATE.md
vendored
3
.github/PULL_REQUEST_TEMPLATE.md
vendored
@ -2,4 +2,5 @@ FILL IN THE PR DESCRIPTION HERE
|
|||||||
|
|
||||||
FIX #xxxx (*link existing issues this PR will resolve*)
|
FIX #xxxx (*link existing issues this PR will resolve*)
|
||||||
|
|
||||||
**BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html **
|
<!--- pyml disable-next-line no-emphasis-as-heading -->
|
||||||
|
**BEFORE SUBMITTING, PLEASE READ <https://docs.vllm.ai/en/latest/contributing/overview.html>**
|
||||||
|
@ -33,7 +33,7 @@ repos:
|
|||||||
rev: v0.9.27
|
rev: v0.9.27
|
||||||
hooks:
|
hooks:
|
||||||
- id: pymarkdown
|
- id: pymarkdown
|
||||||
files: docs/.*
|
args: [fix]
|
||||||
- repo: https://github.com/rhysd/actionlint
|
- repo: https://github.com/rhysd/actionlint
|
||||||
rev: v1.7.7
|
rev: v1.7.7
|
||||||
hooks:
|
hooks:
|
||||||
|
@ -125,4 +125,3 @@ Community Impact Guidelines were inspired by
|
|||||||
For answers to common questions about this code of conduct, see the
|
For answers to common questions about this code of conduct, see the
|
||||||
[Contributor Covenant FAQ](https://www.contributor-covenant.org/faq). Translations are available at
|
[Contributor Covenant FAQ](https://www.contributor-covenant.org/faq). Translations are available at
|
||||||
[Contributor Covenant translations](https://www.contributor-covenant.org/translations).
|
[Contributor Covenant translations](https://www.contributor-covenant.org/translations).
|
||||||
|
|
||||||
|
14
README.md
14
README.md
@ -16,6 +16,7 @@ Easy, fast, and cheap LLM serving for everyone
|
|||||||
---
|
---
|
||||||
|
|
||||||
*Latest News* 🔥
|
*Latest News* 🔥
|
||||||
|
|
||||||
- [2025/01] We are excited to announce the alpha release of vLLM V1: A major architectural upgrade with 1.7x speedup! Clean code, optimized execution loop, zero-overhead prefix caching, enhanced multimodal support, and more. Please check out our blog post [here](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html).
|
- [2025/01] We are excited to announce the alpha release of vLLM V1: A major architectural upgrade with 1.7x speedup! Clean code, optimized execution loop, zero-overhead prefix caching, enhanced multimodal support, and more. Please check out our blog post [here](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html).
|
||||||
- [2025/01] We hosted [the eighth vLLM meetup](https://lu.ma/zep56hui) with Google Cloud! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1epVkt4Zu8Jz_S5OhEHPc798emsYh2BwYfRuDDVEF7u4/edit?usp=sharing), and Google Cloud team [here](https://drive.google.com/file/d/1h24pHewANyRL11xy5dXUbvRC9F9Kkjix/view?usp=sharing).
|
- [2025/01] We hosted [the eighth vLLM meetup](https://lu.ma/zep56hui) with Google Cloud! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1epVkt4Zu8Jz_S5OhEHPc798emsYh2BwYfRuDDVEF7u4/edit?usp=sharing), and Google Cloud team [here](https://drive.google.com/file/d/1h24pHewANyRL11xy5dXUbvRC9F9Kkjix/view?usp=sharing).
|
||||||
- [2024/12] vLLM joins [pytorch ecosystem](https://pytorch.org/blog/vllm-joins-pytorch)! Easy, Fast, and Cheap LLM Serving for Everyone!
|
- [2024/12] vLLM joins [pytorch ecosystem](https://pytorch.org/blog/vllm-joins-pytorch)! Easy, Fast, and Cheap LLM Serving for Everyone!
|
||||||
@ -33,7 +34,9 @@ Easy, fast, and cheap LLM serving for everyone
|
|||||||
- [2023/06] We officially released vLLM! FastChat-vLLM integration has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid-April. Check out our [blog post](https://vllm.ai).
|
- [2023/06] We officially released vLLM! FastChat-vLLM integration has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid-April. Check out our [blog post](https://vllm.ai).
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## About
|
## About
|
||||||
|
|
||||||
vLLM is a fast and easy-to-use library for LLM inference and serving.
|
vLLM is a fast and easy-to-use library for LLM inference and serving.
|
||||||
|
|
||||||
Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
|
Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
|
||||||
@ -127,6 +130,7 @@ We also have an official fundraising venue through [OpenCollective](https://open
|
|||||||
## Citation
|
## Citation
|
||||||
|
|
||||||
If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):
|
If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):
|
||||||
|
|
||||||
```bibtex
|
```bibtex
|
||||||
@inproceedings{kwon2023efficient,
|
@inproceedings{kwon2023efficient,
|
||||||
title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
|
title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
|
||||||
@ -138,11 +142,11 @@ If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs
|
|||||||
|
|
||||||
## Contact Us
|
## Contact Us
|
||||||
|
|
||||||
* For technical questions and feature requests, please use Github issues or discussions.
|
- For technical questions and feature requests, please use Github issues or discussions.
|
||||||
* For discussing with fellow users and coordinating contributions and development, please use Slack.
|
- For discussing with fellow users and coordinating contributions and development, please use Slack.
|
||||||
* For security disclosures, please use Github's security advisory feature.
|
- For security disclosures, please use Github's security advisory feature.
|
||||||
* For collaborations and partnerships, please contact us at vllm-questions AT lists.berkeley.edu.
|
- For collaborations and partnerships, please contact us at vllm-questions AT lists.berkeley.edu.
|
||||||
|
|
||||||
## Media Kit
|
## Media Kit
|
||||||
|
|
||||||
* If you wish to use vLLM's logo, please refer to [our media kit repo](https://github.com/vllm-project/media-kit).
|
- If you wish to use vLLM's logo, please refer to [our media kit repo](https://github.com/vllm-project/media-kit).
|
||||||
|
@ -3,6 +3,7 @@
|
|||||||
## Downloading the ShareGPT dataset
|
## Downloading the ShareGPT dataset
|
||||||
|
|
||||||
You can download the dataset by running:
|
You can download the dataset by running:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
|
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
|
||||||
```
|
```
|
||||||
@ -11,6 +12,7 @@ wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/r
|
|||||||
|
|
||||||
The json file refers to several image datasets (coco, llava, etc.). The benchmark scripts
|
The json file refers to several image datasets (coco, llava, etc.). The benchmark scripts
|
||||||
will ignore a datapoint if the referred image is missing.
|
will ignore a datapoint if the referred image is missing.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/resolve/main/sharegpt4v_instruct_gpt4-vision_cap100k.json
|
wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/resolve/main/sharegpt4v_instruct_gpt4-vision_cap100k.json
|
||||||
mkdir coco -p
|
mkdir coco -p
|
||||||
|
@ -1,17 +1,19 @@
|
|||||||
# CUTLASS Epilogues
|
# CUTLASS Epilogues
|
||||||
|
|
||||||
## Introduction
|
## Introduction
|
||||||
This document describes the various CUTLASS epilogues implemented for fusing de-quantization operations onto GEMMs.
|
|
||||||
|
This document describes the various CUTLASS epilogues implemented for fusing de-quantization operations onto GEMMs.
|
||||||
|
|
||||||
Currently, we only support symmetric quantization for weights,
|
Currently, we only support symmetric quantization for weights,
|
||||||
and symmetric and asymmetric quantization for activations.
|
and symmetric and asymmetric quantization for activations.
|
||||||
Both can be quantized per-tensor or per-channel (weights) / per-token (activations).
|
Both can be quantized per-tensor or per-channel (weights) / per-token (activations).
|
||||||
|
|
||||||
There are 4 epilogues:
|
There are 4 epilogues:
|
||||||
1. ScaledEpilogue: symmetric quantization for activations, no bias.
|
|
||||||
1. ScaledEpilogueBias: symmetric quantization for activations, supports bias.
|
1. `ScaledEpilogue`: symmetric quantization for activations, no bias.
|
||||||
1. ScaledEpilogueAzp: asymmetric per-tensor quantization for activations, supports bias.
|
1. `ScaledEpilogueBias`: symmetric quantization for activations, supports bias.
|
||||||
1. ScaledEpilogueAzpPerToken: asymmetric per-token quantization for activations, supports bias.
|
1. `ScaledEpilogueAzp`: asymmetric per-tensor quantization for activations, supports bias.
|
||||||
|
1. `ScaledEpilogueAzpPerToken`: asymmetric per-token quantization for activations, supports bias.
|
||||||
|
|
||||||
We do not have epilogues for asymmetric quantization of activations without bias in order to reduce final binary size.
|
We do not have epilogues for asymmetric quantization of activations without bias in order to reduce final binary size.
|
||||||
Instead, if no bias is passed, the epilogue will use 0 as the bias.
|
Instead, if no bias is passed, the epilogue will use 0 as the bias.
|
||||||
@ -26,12 +28,15 @@ If $` \widehat X `$ is the quantized $` X `$, our matrices become the following
|
|||||||
```math
|
```math
|
||||||
A = s_a (\widehat A - J_a z_a)
|
A = s_a (\widehat A - J_a z_a)
|
||||||
```
|
```
|
||||||
|
|
||||||
```math
|
```math
|
||||||
B = s_b \widehat B
|
B = s_b \widehat B
|
||||||
```
|
```
|
||||||
|
|
||||||
```math
|
```math
|
||||||
D = A B + C
|
D = A B + C
|
||||||
```
|
```
|
||||||
|
|
||||||
```math
|
```math
|
||||||
D = s_a s_b \widehat D + C
|
D = s_a s_b \widehat D + C
|
||||||
```
|
```
|
||||||
@ -48,9 +53,11 @@ Expanding further, we can calculate $` \widehat D `$ as follows:
|
|||||||
```math
|
```math
|
||||||
A B = s_a ( \widehat A - J_a z_a ) s_b \widehat B
|
A B = s_a ( \widehat A - J_a z_a ) s_b \widehat B
|
||||||
```
|
```
|
||||||
|
|
||||||
```math
|
```math
|
||||||
A B = s_a s_b \left( \widehat A \widehat B - J_a z_a \widehat B \right)
|
A B = s_a s_b \left( \widehat A \widehat B - J_a z_a \widehat B \right)
|
||||||
```
|
```
|
||||||
|
|
||||||
```math
|
```math
|
||||||
\widehat D = \widehat A \widehat B - z_a J_a \widehat B
|
\widehat D = \widehat A \widehat B - z_a J_a \widehat B
|
||||||
```
|
```
|
||||||
@ -61,16 +68,19 @@ Each row of it is equal to $` \mathbf 1 \widehat B `$, which is a row-vector of
|
|||||||
|
|
||||||
## Epilogues
|
## Epilogues
|
||||||
|
|
||||||
### ScaledEpilogue
|
### `ScaledEpilogue`
|
||||||
|
|
||||||
This epilogue computes the symmetric quantization for activations without bias, meaning $` C = 0 `$ and $` z_a = 0 `$.
|
This epilogue computes the symmetric quantization for activations without bias, meaning $` C = 0 `$ and $` z_a = 0 `$.
|
||||||
The output of the GEMM is:
|
The output of the GEMM is:
|
||||||
|
|
||||||
```math
|
```math
|
||||||
\widehat D = \widehat A \widehat B
|
\widehat D = \widehat A \widehat B
|
||||||
```
|
```
|
||||||
|
|
||||||
```math
|
```math
|
||||||
D = s_a s_b \widehat D
|
D = s_a s_b \widehat D
|
||||||
```
|
```
|
||||||
|
|
||||||
```math
|
```math
|
||||||
D = s_a s_b \widehat A \widehat B
|
D = s_a s_b \widehat A \widehat B
|
||||||
```
|
```
|
||||||
@ -79,44 +89,51 @@ Epilogue parameters:
|
|||||||
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
|
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
|
||||||
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
|
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
|
||||||
|
|
||||||
### ScaledEpilogueBias
|
### `ScaledEpilogueBias`
|
||||||
|
|
||||||
This epilogue computes the symmetric quantization for activations with bias, meaning $` z_a = 0 `$.
|
This epilogue computes the symmetric quantization for activations with bias, meaning $` z_a = 0 `$.
|
||||||
The output of the GEMM is:
|
The output of the GEMM is:
|
||||||
|
|
||||||
```math
|
```math
|
||||||
\widehat D = \widehat A \widehat B
|
\widehat D = \widehat A \widehat B
|
||||||
```
|
```
|
||||||
|
|
||||||
```math
|
```math
|
||||||
D = s_a s_b \widehat D + C
|
D = s_a s_b \widehat D + C
|
||||||
```
|
```
|
||||||
|
|
||||||
```math
|
```math
|
||||||
D = s_a s_b \widehat A \widehat B + C
|
D = s_a s_b \widehat A \widehat B + C
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
Epilogue parameters:
|
Epilogue parameters:
|
||||||
|
|
||||||
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
|
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
|
||||||
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
|
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
|
||||||
- `bias` is the bias, is always per-channel (row-vector).
|
- `bias` is the bias, is always per-channel (row-vector).
|
||||||
|
|
||||||
### ScaledEpilogueAzp
|
### `ScaledEpilogueAzp`
|
||||||
|
|
||||||
This epilogue computes the asymmetric per-tensor quantization for activations with bias.
|
This epilogue computes the asymmetric per-tensor quantization for activations with bias.
|
||||||
The output of the GEMM is:
|
The output of the GEMM is:
|
||||||
|
|
||||||
```math
|
```math
|
||||||
\widehat D = \widehat A \widehat B - z_a J_a \widehat B
|
\widehat D = \widehat A \widehat B - z_a J_a \widehat B
|
||||||
```
|
```
|
||||||
|
|
||||||
```math
|
```math
|
||||||
D = s_a s_b \widehat D + C
|
D = s_a s_b \widehat D + C
|
||||||
```
|
```
|
||||||
|
|
||||||
```math
|
```math
|
||||||
D = s_a s_b \left( \widehat A \widehat B - z_a J_a \widehat B \right) + C
|
D = s_a s_b \left( \widehat A \widehat B - z_a J_a \widehat B \right) + C
|
||||||
```
|
```
|
||||||
|
|
||||||
Because $` z_a `$ is a scalar, the zero-point term $` z_a J_a \widehat B `$ has every row equal to $` z_a \mathbf 1 B `$.
|
Because $` z_a `$ is a scalar, the zero-point term $` z_a J_a \widehat B `$ has every row equal to $` z_a \mathbf 1 B `$.
|
||||||
That is precomputed and stored in `azp_with_adj` as a row-vector.
|
That is precomputed and stored in `azp_with_adj` as a row-vector.
|
||||||
|
|
||||||
Epilogue parameters:
|
Epilogue parameters:
|
||||||
|
|
||||||
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
|
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
|
||||||
- Generally this will be per-tensor as the zero-points are per-tensor.
|
- Generally this will be per-tensor as the zero-points are per-tensor.
|
||||||
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
|
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
|
||||||
@ -125,13 +142,15 @@ Epilogue parameters:
|
|||||||
|
|
||||||
To use these kernels efficiently, users must precompute the `azp_with_adj` term offline and pass it to the kernel.
|
To use these kernels efficiently, users must precompute the `azp_with_adj` term offline and pass it to the kernel.
|
||||||
|
|
||||||
### ScaledEpilogueAzpPerToken
|
### `ScaledEpilogueAzpPerToken`
|
||||||
|
|
||||||
This epilogue computes the asymmetric per-token quantization for activations with bias.
|
This epilogue computes the asymmetric per-token quantization for activations with bias.
|
||||||
|
|
||||||
The output of the GEMM is the same as above, but the $` z_a `$ is a column-vector.
|
The output of the GEMM is the same as above, but the $` z_a `$ is a column-vector.
|
||||||
That means the zero-point term $` z_a J_a \widehat B `$ becomes an outer product of $` z_a `$ and $` \mathbf 1 \widehat B `$.
|
That means the zero-point term $` z_a J_a \widehat B `$ becomes an outer product of $` z_a `$ and $` \mathbf 1 \widehat B `$.
|
||||||
|
|
||||||
Epilogue parameters:
|
Epilogue parameters:
|
||||||
|
|
||||||
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
|
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
|
||||||
- Generally this will be per-token as the zero-points are per-token.
|
- Generally this will be per-token as the zero-points are per-token.
|
||||||
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
|
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
|
||||||
@ -142,6 +161,7 @@ Epilogue parameters:
|
|||||||
To use these kernels efficiently, users must precompute the `azp_adj` term offline and pass it to the kernel.
|
To use these kernels efficiently, users must precompute the `azp_adj` term offline and pass it to the kernel.
|
||||||
|
|
||||||
The epilogue performs the following computation (where `Dq` is the raw quantized output of the GEMM):
|
The epilogue performs the following computation (where `Dq` is the raw quantized output of the GEMM):
|
||||||
```
|
|
||||||
|
```math
|
||||||
out = scale_a * scale_b * (Dq - azp_adj * azp) + bias
|
out = scale_a * scale_b * (Dq - azp_adj * azp) + bias
|
||||||
```
|
```
|
||||||
|
@ -6,25 +6,25 @@ Machete is a spiritual successor to the Marlin kernel but optimized for Hopper a
|
|||||||
|
|
||||||
Machete effectively performs
|
Machete effectively performs
|
||||||
|
|
||||||
```
|
```python
|
||||||
scale_type = w_s.dtype
|
scale_type = w_s.dtype
|
||||||
compute_type = a.dtype
|
compute_type = a.dtype
|
||||||
out = (w_q.to(scale_type) * w_s - w_z.to(scale_type)) @ a
|
out = (w_q.to(scale_type) * w_s - w_z.to(scale_type)) @ a
|
||||||
```
|
```
|
||||||
|
|
||||||
Where `w_q` is a quantized weight matrix, `w_s` is the quantization scales, and
|
Where `w_q` is a quantized weight matrix, `w_s` is the quantization scales, and
|
||||||
`w_z` is the quantization zeropoints.
|
`w_z` is the quantization zeropoints.
|
||||||
|
|
||||||
> **_NOTE:_** `w_z` is added after the scales so we can
|
> **_NOTE:_** `w_z` is added after the scales so we can
|
||||||
use FMA operations, but this means they must have the scales pre-applied if the
|
use FMA operations, but this means they must have the scales pre-applied if the
|
||||||
supplied zeropoints assume that they will be subtracted before the scales are
|
supplied zeropoints assume that they will be subtracted before the scales are
|
||||||
applied.
|
applied.
|
||||||
|
|
||||||
## API
|
## API
|
||||||
|
|
||||||
The main optimization within Machete is prepacking the weight matrix to more closely match the tensor core layouts, allowing for wider shared memory loads when loading the weight matrix. This means that the weight matrix must be prepacked before calling `machete_gemm`. The flow looks something like:
|
The main optimization within Machete is prepacking the weight matrix to more closely match the tensor core layouts, allowing for wider shared memory loads when loading the weight matrix. This means that the weight matrix must be prepacked before calling `machete_gemm`. The flow looks something like:
|
||||||
|
|
||||||
```
|
```python
|
||||||
from vllm import _custom_ops as ops
|
from vllm import _custom_ops as ops
|
||||||
|
|
||||||
...
|
...
|
||||||
@ -40,6 +40,6 @@ output = ops.machete_gemm(
|
|||||||
|
|
||||||
## Code Generation
|
## Code Generation
|
||||||
|
|
||||||
Since Machete is based on Cutlass, we can generate multiple type pairs and different tile shapes using the same kernel template. We generate multiple instantiations of this template using `generate.py`.
|
Since Machete is based on Cutlass, we can generate multiple type pairs and different tile shapes using the same kernel template. We generate multiple instantiations of this template using `generate.py`.
|
||||||
|
|
||||||
New type pairs (`TypeConfig`s) can be appended to `impl_configs` (in `generate()`), and these will get automatically generated (assuming they can be supported without issues). For each `TypeConfig`, you must also provide an `ImplConfig`, which bundles a `TypeConfig` with a list of `ScheduleConfig`s, `Specialization`s, and a default heuristic. The `ScheduleConfig`s (which contain info on tile shapes, tile scheduler, etc.) can perform differently for different problem shapes, and there is almost never one `ScheduleConfig` that works well for all problem shapes, so it is generally beneficial to generate different `ScheduleConfig`s for different potential problem shapes. This is where the heuristic comes in. For each `TypeConfig`, a default heuristic should be provided. This maps different problem shapes to different `ScheduleConfig`s and is used when the user does not provide the `schedule` parameter to `machete_gemm`. The `Specialization`s define what feature combinations to generate, i.e., `with_zeropoints`, `with_scales`, etc. We can reduce compile times and the final binary size by limiting the set of feature combinations we generate.
|
New type pairs (`TypeConfig`s) can be appended to `impl_configs` (in `generate()`), and these will get automatically generated (assuming they can be supported without issues). For each `TypeConfig`, you must also provide an `ImplConfig`, which bundles a `TypeConfig` with a list of `ScheduleConfig`s, `Specialization`s, and a default heuristic. The `ScheduleConfig`s (which contain info on tile shapes, tile scheduler, etc.) can perform differently for different problem shapes, and there is almost never one `ScheduleConfig` that works well for all problem shapes, so it is generally beneficial to generate different `ScheduleConfig`s for different potential problem shapes. This is where the heuristic comes in. For each `TypeConfig`, a default heuristic should be provided. This maps different problem shapes to different `ScheduleConfig`s and is used when the user does not provide the `schedule` parameter to `machete_gemm`. The `Specialization`s define what feature combinations to generate, i.e., `with_zeropoints`, `with_scales`, etc. We can reduce compile times and the final binary size by limiting the set of feature combinations we generate.
|
||||||
|
@ -93,12 +93,11 @@ Currently, there are no pre-built ROCm wheels.
|
|||||||
|
|
||||||
This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation.
|
This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation.
|
||||||
|
|
||||||
<!--- pyml disable-num-lines 5 ul-indent-->
|
|
||||||
:::{tip}
|
:::{tip}
|
||||||
- Triton flash attention is used by default. For benchmarking purposes, it is recommended to run a warm up step before collecting perf numbers.
|
- Triton flash attention is used by default. For benchmarking purposes, it is recommended to run a warm up step before collecting perf numbers.
|
||||||
- Triton flash attention does not currently support sliding window attention. If using half precision, please use CK flash-attention for sliding window support.
|
- Triton flash attention does not currently support sliding window attention. If using half precision, please use CK flash-attention for sliding window support.
|
||||||
- To use CK flash-attention or PyTorch naive attention, please use this flag `export VLLM_USE_TRITON_FLASH_ATTN=0` to turn off triton flash attention.
|
- To use CK flash-attention or PyTorch naive attention, please use this flag `export VLLM_USE_TRITON_FLASH_ATTN=0` to turn off triton flash attention.
|
||||||
- The ROCm version of PyTorch, ideally, should match the ROCm driver version.
|
- The ROCm version of PyTorch, ideally, should match the ROCm driver version.
|
||||||
:::
|
:::
|
||||||
|
|
||||||
:::{tip}
|
:::{tip}
|
||||||
|
@ -4,7 +4,7 @@
|
|||||||
|
|
||||||
Below, you can find an explanation of every engine argument for vLLM:
|
Below, you can find an explanation of every engine argument for vLLM:
|
||||||
|
|
||||||
<!--- pyml disable-num-lines 7 no-space-in-emphasis-->
|
<!--- pyml disable-num-lines 7 no-space-in-emphasis -->
|
||||||
```{eval-rst}
|
```{eval-rst}
|
||||||
.. argparse::
|
.. argparse::
|
||||||
:module: vllm.engine.arg_utils
|
:module: vllm.engine.arg_utils
|
||||||
@ -17,7 +17,7 @@ Below, you can find an explanation of every engine argument for vLLM:
|
|||||||
|
|
||||||
Below are the additional arguments related to the asynchronous engine:
|
Below are the additional arguments related to the asynchronous engine:
|
||||||
|
|
||||||
<!--- pyml disable-num-lines 7 no-space-in-emphasis-->
|
<!--- pyml disable-num-lines 7 no-space-in-emphasis -->
|
||||||
```{eval-rst}
|
```{eval-rst}
|
||||||
.. argparse::
|
.. argparse::
|
||||||
:module: vllm.engine.arg_utils
|
:module: vllm.engine.arg_utils
|
||||||
|
@ -5,50 +5,49 @@ This is a guide to performing batch inference using the OpenAI batch file format
|
|||||||
```
|
```
|
||||||
|
|
||||||
## File Format
|
## File Format
|
||||||
|
|
||||||
The OpenAI batch file format consists of a series of json objects on new lines.
|
The OpenAI batch file format consists of a series of json objects on new lines.
|
||||||
|
|
||||||
[See here for an example file.](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/openai/openai_example_batch.jsonl)
|
[See here for an example file.](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/openai/openai_example_batch.jsonl)
|
||||||
|
|
||||||
Each line represents a separate request. See the [OpenAI package reference](https://platform.openai.com/docs/api-reference/batch/requestInput) for more details.
|
Each line represents a separate request. See the [OpenAI package reference](https://platform.openai.com/docs/api-reference/batch/requestInput) for more details.
|
||||||
|
|
||||||
```{note}
|
```{note}
|
||||||
We currently support `/v1/chat/completions`, `/v1/embeddings`, and `/v1/score` endpoints (completions coming soon).
|
We currently support `/v1/chat/completions`, `/v1/embeddings`, and `/v1/score` endpoints (completions coming soon).
|
||||||
```
|
```
|
||||||
|
|
||||||
## Pre-requisites
|
## Pre-requisites
|
||||||
|
|
||||||
* The examples in this document use `meta-llama/Meta-Llama-3-8B-Instruct`.
|
* The examples in this document use `meta-llama/Meta-Llama-3-8B-Instruct`.
|
||||||
- Create a [user access token](https://huggingface.co/docs/hub/en/security-tokens)
|
- Create a [user access token](https://huggingface.co/docs/hub/en/security-tokens)
|
||||||
- Install the token on your machine (Run `huggingface-cli login`).
|
- Install the token on your machine (Run `huggingface-cli login`).
|
||||||
- Get access to the gated model by [visiting the model card](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and agreeing to the terms and conditions.
|
- Get access to the gated model by [visiting the model card](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and agreeing to the terms and conditions.
|
||||||
|
|
||||||
|
|
||||||
## Example 1: Running with a local file
|
## Example 1: Running with a local file
|
||||||
|
|
||||||
### Step 1: Create your batch file
|
### Step 1: Create your batch file
|
||||||
|
|
||||||
To follow along with this example, you can download the example batch, or create your own batch file in your working directory.
|
To follow along with this example, you can download the example batch, or create your own batch file in your working directory.
|
||||||
|
|
||||||
```
|
```console
|
||||||
wget https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai/openai_example_batch.jsonl
|
wget https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai/openai_example_batch.jsonl
|
||||||
```
|
```
|
||||||
|
|
||||||
Once you've created your batch file it should look like this
|
Once you've created your batch file it should look like this
|
||||||
|
|
||||||
```
|
```console
|
||||||
$ cat offline_inference/openai/openai_example_batch.jsonl
|
$ cat offline_inference/openai/openai_example_batch.jsonl
|
||||||
{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_completion_tokens": 1000}}
|
{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_completion_tokens": 1000}}
|
||||||
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "Hello world!"}],"max_completion_tokens": 1000}}
|
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "Hello world!"}],"max_completion_tokens": 1000}}
|
||||||
```
|
```
|
||||||
|
|
||||||
### Step 2: Run the batch
|
### Step 2: Run the batch
|
||||||
|
|
||||||
The batch running tool is designed to be used from the command line.
|
The batch running tool is designed to be used from the command line.
|
||||||
|
|
||||||
You can run the batch with the following command, which will write its results to a file called `results.jsonl`
|
You can run the batch with the following command, which will write its results to a file called `results.jsonl`
|
||||||
|
|
||||||
```
|
```console
|
||||||
python -m vllm.entrypoints.openai.run_batch -i offline_inference/openai/openai_example_batch.jsonl -o results.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct
|
python -m vllm.entrypoints.openai.run_batch -i offline_inference/openai/openai_example_batch.jsonl -o results.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -56,7 +55,7 @@ python -m vllm.entrypoints.openai.run_batch -i offline_inference/openai/openai_e
|
|||||||
|
|
||||||
You should now have your results at `results.jsonl`. You can check your results by running `cat results.jsonl`
|
You should now have your results at `results.jsonl`. You can check your results by running `cat results.jsonl`
|
||||||
|
|
||||||
```
|
```console
|
||||||
$ cat results.jsonl
|
$ cat results.jsonl
|
||||||
{"id":"vllm-383d1c59835645aeb2e07d004d62a826","custom_id":"request-1","response":{"id":"cmpl-61c020e54b964d5a98fa7527bfcdd378","object":"chat.completion","created":1715633336,"model":"meta-llama/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! It's great to meet you! I'm here to help with any questions or tasks you may have. What's on your mind today?"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":25,"total_tokens":56,"completion_tokens":31}},"error":null}
|
{"id":"vllm-383d1c59835645aeb2e07d004d62a826","custom_id":"request-1","response":{"id":"cmpl-61c020e54b964d5a98fa7527bfcdd378","object":"chat.completion","created":1715633336,"model":"meta-llama/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! It's great to meet you! I'm here to help with any questions or tasks you may have. What's on your mind today?"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":25,"total_tokens":56,"completion_tokens":31}},"error":null}
|
||||||
{"id":"vllm-42e3d09b14b04568afa3f1797751a267","custom_id":"request-2","response":{"id":"cmpl-f44d049f6b3a42d4b2d7850bb1e31bcc","object":"chat.completion","created":1715633336,"model":"meta-llama/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"*silence*"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":27,"total_tokens":32,"completion_tokens":5}},"error":null}
|
{"id":"vllm-42e3d09b14b04568afa3f1797751a267","custom_id":"request-2","response":{"id":"cmpl-f44d049f6b3a42d4b2d7850bb1e31bcc","object":"chat.completion","created":1715633336,"model":"meta-llama/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"*silence*"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":27,"total_tokens":32,"completion_tokens":5}},"error":null}
|
||||||
@ -68,7 +67,7 @@ The batch runner supports remote input and output urls that are accessible via h
|
|||||||
|
|
||||||
For example, to run against our example input file located at `https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai/openai_example_batch.jsonl`, you can run
|
For example, to run against our example input file located at `https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai/openai_example_batch.jsonl`, you can run
|
||||||
|
|
||||||
```
|
```console
|
||||||
python -m vllm.entrypoints.openai.run_batch -i https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai/openai_example_batch.jsonl -o results.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct
|
python -m vllm.entrypoints.openai.run_batch -i https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai/openai_example_batch.jsonl -o results.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -80,7 +79,7 @@ To integrate with cloud blob storage, we recommend using presigned urls.
|
|||||||
|
|
||||||
### Additional prerequisites
|
### Additional prerequisites
|
||||||
|
|
||||||
* [Create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html).
|
* [Create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html).
|
||||||
* The `awscli` package (Run `pip install awscli`) to configure your credentials and interactively use s3.
|
* The `awscli` package (Run `pip install awscli`) to configure your credentials and interactively use s3.
|
||||||
- [Configure your credentials](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html).
|
- [Configure your credentials](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html).
|
||||||
* The `boto3` python package (Run `pip install boto3`) to generate presigned urls.
|
* The `boto3` python package (Run `pip install boto3`) to generate presigned urls.
|
||||||
@ -89,13 +88,13 @@ To integrate with cloud blob storage, we recommend using presigned urls.
|
|||||||
|
|
||||||
To follow along with this example, you can download the example batch, or create your own batch file in your working directory.
|
To follow along with this example, you can download the example batch, or create your own batch file in your working directory.
|
||||||
|
|
||||||
```
|
```console
|
||||||
wget https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai/openai_example_batch.jsonl
|
wget https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai/openai_example_batch.jsonl
|
||||||
```
|
```
|
||||||
|
|
||||||
Once you've created your batch file it should look like this
|
Once you've created your batch file it should look like this
|
||||||
|
|
||||||
```
|
```console
|
||||||
$ cat offline_inference/openai/openai_example_batch.jsonl
|
$ cat offline_inference/openai/openai_example_batch.jsonl
|
||||||
{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_completion_tokens": 1000}}
|
{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_completion_tokens": 1000}}
|
||||||
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "Hello world!"}],"max_completion_tokens": 1000}}
|
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "Hello world!"}],"max_completion_tokens": 1000}}
|
||||||
@ -103,7 +102,7 @@ $ cat offline_inference/openai/openai_example_batch.jsonl
|
|||||||
|
|
||||||
Now upload your batch file to your S3 bucket.
|
Now upload your batch file to your S3 bucket.
|
||||||
|
|
||||||
```
|
```console
|
||||||
aws s3 cp offline_inference/openai/openai_example_batch.jsonl s3://MY_BUCKET/MY_INPUT_FILE.jsonl
|
aws s3 cp offline_inference/openai/openai_example_batch.jsonl s3://MY_BUCKET/MY_INPUT_FILE.jsonl
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -111,9 +110,9 @@ aws s3 cp offline_inference/openai/openai_example_batch.jsonl s3://MY_BUCKET/MY_
|
|||||||
|
|
||||||
Presigned urls can only be generated via the SDK. You can run the following python script to generate your presigned urls. Be sure to replace the `MY_BUCKET`, `MY_INPUT_FILE.jsonl`, and `MY_OUTPUT_FILE.jsonl` placeholders with your bucket and file names.
|
Presigned urls can only be generated via the SDK. You can run the following python script to generate your presigned urls. Be sure to replace the `MY_BUCKET`, `MY_INPUT_FILE.jsonl`, and `MY_OUTPUT_FILE.jsonl` placeholders with your bucket and file names.
|
||||||
|
|
||||||
(The script is adapted from https://github.com/awsdocs/aws-doc-sdk-examples/blob/main/python/example_code/s3/s3_basics/presigned_url.py)
|
(The script is adapted from <https://github.com/awsdocs/aws-doc-sdk-examples/blob/main/python/example_code/s3/s3_basics/presigned_url.py>)
|
||||||
|
|
||||||
```
|
```python
|
||||||
import boto3
|
import boto3
|
||||||
from botocore.exceptions import ClientError
|
from botocore.exceptions import ClientError
|
||||||
|
|
||||||
@ -149,7 +148,7 @@ print(f"{output_url=}")
|
|||||||
|
|
||||||
This script should output
|
This script should output
|
||||||
|
|
||||||
```
|
```text
|
||||||
input_url='https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_INPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091'
|
input_url='https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_INPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091'
|
||||||
output_url='https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_OUTPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091'
|
output_url='https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_OUTPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091'
|
||||||
```
|
```
|
||||||
@ -158,7 +157,7 @@ output_url='https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_OUTPUT_FILE.jsonl?AW
|
|||||||
|
|
||||||
You can now run the batch runner, using the urls generated in the previous section.
|
You can now run the batch runner, using the urls generated in the previous section.
|
||||||
|
|
||||||
```
|
```console
|
||||||
python -m vllm.entrypoints.openai.run_batch \
|
python -m vllm.entrypoints.openai.run_batch \
|
||||||
-i "https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_INPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091" \
|
-i "https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_INPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091" \
|
||||||
-o "https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_OUTPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091" \
|
-o "https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_OUTPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091" \
|
||||||
@ -169,7 +168,7 @@ python -m vllm.entrypoints.openai.run_batch \
|
|||||||
|
|
||||||
Your results are now on S3. You can view them in your terminal by running
|
Your results are now on S3. You can view them in your terminal by running
|
||||||
|
|
||||||
```
|
```console
|
||||||
aws s3 cp s3://MY_BUCKET/MY_OUTPUT_FILE.jsonl -
|
aws s3 cp s3://MY_BUCKET/MY_OUTPUT_FILE.jsonl -
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -180,10 +179,10 @@ aws s3 cp s3://MY_BUCKET/MY_OUTPUT_FILE.jsonl -
|
|||||||
* Ensure you are using `vllm >= 0.5.5`.
|
* Ensure you are using `vllm >= 0.5.5`.
|
||||||
|
|
||||||
### Step 1: Create your batch file
|
### Step 1: Create your batch file
|
||||||
|
|
||||||
Add embedding requests to your batch file. The following is an example:
|
Add embedding requests to your batch file. The following is an example:
|
||||||
|
|
||||||
```
|
```text
|
||||||
{"custom_id": "request-1", "method": "POST", "url": "/v1/embeddings", "body": {"model": "intfloat/e5-mistral-7b-instruct", "input": "You are a helpful assistant."}}
|
{"custom_id": "request-1", "method": "POST", "url": "/v1/embeddings", "body": {"model": "intfloat/e5-mistral-7b-instruct", "input": "You are a helpful assistant."}}
|
||||||
{"custom_id": "request-2", "method": "POST", "url": "/v1/embeddings", "body": {"model": "intfloat/e5-mistral-7b-instruct", "input": "You are an unhelpful assistant."}}
|
{"custom_id": "request-2", "method": "POST", "url": "/v1/embeddings", "body": {"model": "intfloat/e5-mistral-7b-instruct", "input": "You are an unhelpful assistant."}}
|
||||||
```
|
```
|
||||||
@ -198,7 +197,7 @@ You can run the batch using the same command as in earlier examples.
|
|||||||
|
|
||||||
You can check your results by running `cat results.jsonl`
|
You can check your results by running `cat results.jsonl`
|
||||||
|
|
||||||
```
|
```console
|
||||||
$ cat results.jsonl
|
$ cat results.jsonl
|
||||||
{"id":"vllm-db0f71f7dec244e6bce530e0b4ef908b","custom_id":"request-1","response":{"status_code":200,"request_id":"vllm-batch-3580bf4d4ae54d52b67eee266a6eab20","body":{"id":"embd-33ac2efa7996430184461f2e38529746","object":"list","created":444647,"model":"intfloat/e5-mistral-7b-instruct","data":[{"index":0,"object":"embedding","embedding":[0.016204833984375,0.0092010498046875,0.0018358230590820312,-0.0028228759765625,0.001422882080078125,-0.0031147003173828125,...]}],"usage":{"prompt_tokens":8,"total_tokens":8,"completion_tokens":0}}},"error":null}
|
{"id":"vllm-db0f71f7dec244e6bce530e0b4ef908b","custom_id":"request-1","response":{"status_code":200,"request_id":"vllm-batch-3580bf4d4ae54d52b67eee266a6eab20","body":{"id":"embd-33ac2efa7996430184461f2e38529746","object":"list","created":444647,"model":"intfloat/e5-mistral-7b-instruct","data":[{"index":0,"object":"embedding","embedding":[0.016204833984375,0.0092010498046875,0.0018358230590820312,-0.0028228759765625,0.001422882080078125,-0.0031147003173828125,...]}],"usage":{"prompt_tokens":8,"total_tokens":8,"completion_tokens":0}}},"error":null}
|
||||||
...
|
...
|
||||||
@ -211,10 +210,10 @@ $ cat results.jsonl
|
|||||||
* Ensure you are using `vllm >= 0.7.0`.
|
* Ensure you are using `vllm >= 0.7.0`.
|
||||||
|
|
||||||
### Step 1: Create your batch file
|
### Step 1: Create your batch file
|
||||||
|
|
||||||
Add score requests to your batch file. The following is an example:
|
Add score requests to your batch file. The following is an example:
|
||||||
|
|
||||||
```
|
```text
|
||||||
{"custom_id": "request-1", "method": "POST", "url": "/v1/score", "body": {"model": "BAAI/bge-reranker-v2-m3", "text_1": "What is the capital of France?", "text_2": ["The capital of Brazil is Brasilia.", "The capital of France is Paris."]}}
|
{"custom_id": "request-1", "method": "POST", "url": "/v1/score", "body": {"model": "BAAI/bge-reranker-v2-m3", "text_1": "What is the capital of France?", "text_2": ["The capital of Brazil is Brasilia.", "The capital of France is Paris."]}}
|
||||||
{"custom_id": "request-2", "method": "POST", "url": "/v1/score", "body": {"model": "BAAI/bge-reranker-v2-m3", "text_1": "What is the capital of France?", "text_2": ["The capital of Brazil is Brasilia.", "The capital of France is Paris."]}}
|
{"custom_id": "request-2", "method": "POST", "url": "/v1/score", "body": {"model": "BAAI/bge-reranker-v2-m3", "text_1": "What is the capital of France?", "text_2": ["The capital of Brazil is Brasilia.", "The capital of France is Paris."]}}
|
||||||
```
|
```
|
||||||
@ -229,7 +228,7 @@ You can run the batch using the same command as in earlier examples.
|
|||||||
|
|
||||||
You can check your results by running `cat results.jsonl`
|
You can check your results by running `cat results.jsonl`
|
||||||
|
|
||||||
```
|
```console
|
||||||
$ cat results.jsonl
|
$ cat results.jsonl
|
||||||
{"id":"vllm-f87c5c4539184f618e555744a2965987","custom_id":"request-1","response":{"status_code":200,"request_id":"vllm-batch-806ab64512e44071b37d3f7ccd291413","body":{"id":"score-4ee45236897b4d29907d49b01298cdb1","object":"list","created":1737847944,"model":"BAAI/bge-reranker-v2-m3","data":[{"index":0,"object":"score","score":0.0010900497436523438},{"index":1,"object":"score","score":1.0}],"usage":{"prompt_tokens":37,"total_tokens":37,"completion_tokens":0,"prompt_tokens_details":null}}},"error":null}
|
{"id":"vllm-f87c5c4539184f618e555744a2965987","custom_id":"request-1","response":{"status_code":200,"request_id":"vllm-batch-806ab64512e44071b37d3f7ccd291413","body":{"id":"score-4ee45236897b4d29907d49b01298cdb1","object":"list","created":1737847944,"model":"BAAI/bge-reranker-v2-m3","data":[{"index":0,"object":"score","score":0.0010900497436523438},{"index":1,"object":"score","score":1.0}],"usage":{"prompt_tokens":37,"total_tokens":37,"completion_tokens":0,"prompt_tokens_details":null}}},"error":null}
|
||||||
{"id":"vllm-41990c51a26d4fac8419077f12871099","custom_id":"request-2","response":{"status_code":200,"request_id":"vllm-batch-73ce66379026482699f81974e14e1e99","body":{"id":"score-13f2ffe6ba40460fbf9f7f00ad667d75","object":"list","created":1737847944,"model":"BAAI/bge-reranker-v2-m3","data":[{"index":0,"object":"score","score":0.001094818115234375},{"index":1,"object":"score","score":1.0}],"usage":{"prompt_tokens":37,"total_tokens":37,"completion_tokens":0,"prompt_tokens_details":null}}},"error":null}
|
{"id":"vllm-41990c51a26d4fac8419077f12871099","custom_id":"request-2","response":{"status_code":200,"request_id":"vllm-batch-73ce66379026482699f81974e14e1e99","body":{"id":"score-13f2ffe6ba40460fbf9f7f00ad667d75","object":"list","created":1737847944,"model":"BAAI/bge-reranker-v2-m3","data":[{"index":0,"object":"score","score":0.001094818115234375},{"index":1,"object":"score","score":1.0}],"usage":{"prompt_tokens":37,"total_tokens":37,"completion_tokens":0,"prompt_tokens_details":null}}},"error":null}
|
||||||
|
@ -29,7 +29,6 @@ python3 profiling.py \
|
|||||||
--profile-result-dir profiles
|
--profile-result-dir profiles
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
### Generate Decode Trace
|
### Generate Decode Trace
|
||||||
|
|
||||||
This example runs Llama 3.1 70B with a batch of 32 requests where each has 1 input token and 128 output tokens. This is set up in attempt to profile just the 32 decodes running in parallel by having an extremely small prefill of 1 token and setting `VLLM_TPU_PROFILE_DELAY_MS=1000` to skip the first second of inference (hopefully prefill).
|
This example runs Llama 3.1 70B with a batch of 32 requests where each has 1 input token and 128 output tokens. This is set up in attempt to profile just the 32 decodes running in parallel by having an extremely small prefill of 1 token and setting `VLLM_TPU_PROFILE_DELAY_MS=1000` to skip the first second of inference (hopefully prefill).
|
||||||
@ -51,17 +50,18 @@ python3 profiling.py \
|
|||||||
--max-model-len 2048 --tensor-parallel-size 8
|
--max-model-len 2048 --tensor-parallel-size 8
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
## Visualizing the profiles
|
## Visualizing the profiles
|
||||||
|
|
||||||
Once you have collected your profiles with this script, you can visualize them using [TensorBoard](https://cloud.google.com/tpu/docs/pytorch-xla-performance-profiling-tpu-vm).
|
Once you have collected your profiles with this script, you can visualize them using [TensorBoard](https://cloud.google.com/tpu/docs/pytorch-xla-performance-profiling-tpu-vm).
|
||||||
|
|
||||||
Here are most likely the dependencies you need to install:
|
Here are most likely the dependencies you need to install:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install tensorflow-cpu tensorboard-plugin-profile etils importlib_resources
|
pip install tensorflow-cpu tensorboard-plugin-profile etils importlib_resources
|
||||||
```
|
```
|
||||||
|
|
||||||
Then you just need to point TensorBoard to the directory where you saved the profiles and visit `http://localhost:6006/` in your browser:
|
Then you just need to point TensorBoard to the directory where you saved the profiles and visit `http://localhost:6006/` in your browser:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
tensorboard --logdir profiles/ --port 6006
|
tensorboard --logdir profiles/ --port 6006
|
||||||
```
|
```
|
||||||
|
@ -18,4 +18,4 @@ This directory contains a Helm chart for deploying the vllm application. The cha
|
|||||||
- templates/poddisruptionbudget.yaml: Template for Pod Disruption Budget.
|
- templates/poddisruptionbudget.yaml: Template for Pod Disruption Budget.
|
||||||
- templates/pvc.yaml: Template for Persistent Volume Claims.
|
- templates/pvc.yaml: Template for Persistent Volume Claims.
|
||||||
- templates/secrets.yaml: Template for Kubernetes Secrets.
|
- templates/secrets.yaml: Template for Kubernetes Secrets.
|
||||||
- templates/service.yaml: Template for creating Services.
|
- templates/service.yaml: Template for creating Services.
|
||||||
|
@ -1,7 +1,8 @@
|
|||||||
# Setup OpenTelemetry POC
|
# Setup OpenTelemetry POC
|
||||||
|
|
||||||
1. Install OpenTelemetry packages:
|
1. Install OpenTelemetry packages:
|
||||||
```
|
|
||||||
|
```console
|
||||||
pip install \
|
pip install \
|
||||||
'opentelemetry-sdk>=1.26.0,<1.27.0' \
|
'opentelemetry-sdk>=1.26.0,<1.27.0' \
|
||||||
'opentelemetry-api>=1.26.0,<1.27.0' \
|
'opentelemetry-api>=1.26.0,<1.27.0' \
|
||||||
@ -10,7 +11,8 @@
|
|||||||
```
|
```
|
||||||
|
|
||||||
1. Start Jaeger in a docker container:
|
1. Start Jaeger in a docker container:
|
||||||
```
|
|
||||||
|
```console
|
||||||
# From: https://www.jaegertracing.io/docs/1.57/getting-started/
|
# From: https://www.jaegertracing.io/docs/1.57/getting-started/
|
||||||
docker run --rm --name jaeger \
|
docker run --rm --name jaeger \
|
||||||
-e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
|
-e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
|
||||||
@ -28,19 +30,23 @@
|
|||||||
```
|
```
|
||||||
|
|
||||||
1. In a new shell, export Jaeger IP:
|
1. In a new shell, export Jaeger IP:
|
||||||
```
|
|
||||||
|
```console
|
||||||
export JAEGER_IP=$(docker inspect --format '{{ .NetworkSettings.IPAddress }}' jaeger)
|
export JAEGER_IP=$(docker inspect --format '{{ .NetworkSettings.IPAddress }}' jaeger)
|
||||||
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=grpc://$JAEGER_IP:4317
|
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=grpc://$JAEGER_IP:4317
|
||||||
```
|
```
|
||||||
|
|
||||||
Then set vLLM's service name for OpenTelemetry, enable insecure connections to Jaeger and run vLLM:
|
Then set vLLM's service name for OpenTelemetry, enable insecure connections to Jaeger and run vLLM:
|
||||||
```
|
|
||||||
|
```console
|
||||||
export OTEL_SERVICE_NAME="vllm-server"
|
export OTEL_SERVICE_NAME="vllm-server"
|
||||||
export OTEL_EXPORTER_OTLP_TRACES_INSECURE=true
|
export OTEL_EXPORTER_OTLP_TRACES_INSECURE=true
|
||||||
vllm serve facebook/opt-125m --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
|
vllm serve facebook/opt-125m --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
|
||||||
```
|
```
|
||||||
|
|
||||||
1. In a new shell, send requests with trace context from a dummy client
|
1. In a new shell, send requests with trace context from a dummy client
|
||||||
```
|
|
||||||
|
```console
|
||||||
export JAEGER_IP=$(docker inspect --format '{{ .NetworkSettings.IPAddress }}' jaeger)
|
export JAEGER_IP=$(docker inspect --format '{{ .NetworkSettings.IPAddress }}' jaeger)
|
||||||
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=grpc://$JAEGER_IP:4317
|
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=grpc://$JAEGER_IP:4317
|
||||||
export OTEL_EXPORTER_OTLP_TRACES_INSECURE=true
|
export OTEL_EXPORTER_OTLP_TRACES_INSECURE=true
|
||||||
@ -48,7 +54,7 @@
|
|||||||
python dummy_client.py
|
python dummy_client.py
|
||||||
```
|
```
|
||||||
|
|
||||||
1. Open Jaeger webui: http://localhost:16686/
|
1. Open Jaeger webui: <http://localhost:16686/>
|
||||||
|
|
||||||
In the search pane, select `vllm-server` service and hit `Find Traces`. You should get a list of traces, one for each request.
|
In the search pane, select `vllm-server` service and hit `Find Traces`. You should get a list of traces, one for each request.
|
||||||

|

|
||||||
@ -57,26 +63,32 @@
|
|||||||

|

|
||||||
|
|
||||||
## Exporter Protocol
|
## Exporter Protocol
|
||||||
|
|
||||||
OpenTelemetry supports either `grpc` or `http/protobuf` as the transport protocol for trace data in the exporter.
|
OpenTelemetry supports either `grpc` or `http/protobuf` as the transport protocol for trace data in the exporter.
|
||||||
By default, `grpc` is used. To set `http/protobuf` as the protocol, configure the `OTEL_EXPORTER_OTLP_TRACES_PROTOCOL` environment variable as follows:
|
By default, `grpc` is used. To set `http/protobuf` as the protocol, configure the `OTEL_EXPORTER_OTLP_TRACES_PROTOCOL` environment variable as follows:
|
||||||
```
|
|
||||||
|
```console
|
||||||
export OTEL_EXPORTER_OTLP_TRACES_PROTOCOL=http/protobuf
|
export OTEL_EXPORTER_OTLP_TRACES_PROTOCOL=http/protobuf
|
||||||
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://$JAEGER_IP:4318/v1/traces
|
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://$JAEGER_IP:4318/v1/traces
|
||||||
vllm serve facebook/opt-125m --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
|
vllm serve facebook/opt-125m --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
|
||||||
```
|
```
|
||||||
|
|
||||||
## Instrumentation of FastAPI
|
## Instrumentation of FastAPI
|
||||||
|
|
||||||
OpenTelemetry allows automatic instrumentation of FastAPI.
|
OpenTelemetry allows automatic instrumentation of FastAPI.
|
||||||
|
|
||||||
1. Install the instrumentation library
|
1. Install the instrumentation library
|
||||||
```
|
|
||||||
|
```console
|
||||||
pip install opentelemetry-instrumentation-fastapi
|
pip install opentelemetry-instrumentation-fastapi
|
||||||
```
|
```
|
||||||
|
|
||||||
1. Run vLLM with `opentelemetry-instrument`
|
1. Run vLLM with `opentelemetry-instrument`
|
||||||
```
|
|
||||||
|
```console
|
||||||
opentelemetry-instrument vllm serve facebook/opt-125m
|
opentelemetry-instrument vllm serve facebook/opt-125m
|
||||||
```
|
```
|
||||||
|
|
||||||
1. Send a request to vLLM and find its trace in Jaeger. It should contain spans from FastAPI.
|
1. Send a request to vLLM and find its trace in Jaeger. It should contain spans from FastAPI.
|
||||||
|
|
||||||

|

|
||||||
|
@ -1,14 +1,16 @@
|
|||||||
# Prometheus and Grafana
|
# Prometheus and Grafana
|
||||||
|
|
||||||
This is a simple example that shows you how to connect vLLM metric logging to the Prometheus/Grafana stack. For this example, we launch Prometheus and Grafana via Docker. You can checkout other methods through [Prometheus](https://prometheus.io/) and [Grafana](https://grafana.com/) websites.
|
This is a simple example that shows you how to connect vLLM metric logging to the Prometheus/Grafana stack. For this example, we launch Prometheus and Grafana via Docker. You can checkout other methods through [Prometheus](https://prometheus.io/) and [Grafana](https://grafana.com/) websites.
|
||||||
|
|
||||||
|
Install:
|
||||||
|
|
||||||
Install:
|
|
||||||
- [`docker`](https://docs.docker.com/engine/install/)
|
- [`docker`](https://docs.docker.com/engine/install/)
|
||||||
- [`docker compose`](https://docs.docker.com/compose/install/linux/#install-using-the-repository)
|
- [`docker compose`](https://docs.docker.com/compose/install/linux/#install-using-the-repository)
|
||||||
|
|
||||||
## Launch
|
## Launch
|
||||||
|
|
||||||
Prometheus metric logging is enabled by default in the OpenAI-compatible server. Launch via the entrypoint:
|
Prometheus metric logging is enabled by default in the OpenAI-compatible server. Launch via the entrypoint:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
vllm serve mistralai/Mistral-7B-v0.1 \
|
vllm serve mistralai/Mistral-7B-v0.1 \
|
||||||
--max-model-len 2048 \
|
--max-model-len 2048 \
|
||||||
@ -16,11 +18,13 @@ vllm serve mistralai/Mistral-7B-v0.1 \
|
|||||||
```
|
```
|
||||||
|
|
||||||
Launch Prometheus and Grafana servers with `docker compose`:
|
Launch Prometheus and Grafana servers with `docker compose`:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
docker compose up
|
docker compose up
|
||||||
```
|
```
|
||||||
|
|
||||||
Submit some sample requests to the server:
|
Submit some sample requests to the server:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
|
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
|
||||||
|
|
||||||
@ -41,13 +45,13 @@ Navigate to [`http://localhost:3000`](http://localhost:3000). Log in with the de
|
|||||||
|
|
||||||
### Add Prometheus Data Source
|
### Add Prometheus Data Source
|
||||||
|
|
||||||
Navigate to [`http://localhost:3000/connections/datasources/new`](http://localhost:3000/connections/datasources/new) and select Prometheus.
|
Navigate to [`http://localhost:3000/connections/datasources/new`](http://localhost:3000/connections/datasources/new) and select Prometheus.
|
||||||
|
|
||||||
On Prometheus configuration page, we need to add the `Prometheus Server URL` in `Connection`. For this setup, Grafana and Prometheus are running in separate containers, but Docker creates DNS name for each containers. You can just use `http://prometheus:9090`.
|
On Prometheus configuration page, we need to add the `Prometheus Server URL` in `Connection`. For this setup, Grafana and Prometheus are running in separate containers, but Docker creates DNS name for each containers. You can just use `http://prometheus:9090`.
|
||||||
|
|
||||||
Click `Save & Test`. You should get a green check saying "Successfully queried the Prometheus API.".
|
Click `Save & Test`. You should get a green check saying "Successfully queried the Prometheus API.".
|
||||||
|
|
||||||
### Import Dashboard
|
### Import Dashboard
|
||||||
|
|
||||||
Navigate to [`http://localhost:3000/dashboard/import`](http://localhost:3000/dashboard/import), upload `grafana.json`, and select the `prometheus` datasource. You should see a screen that looks like the following:
|
Navigate to [`http://localhost:3000/dashboard/import`](http://localhost:3000/dashboard/import), upload `grafana.json`, and select the `prometheus` datasource. You should see a screen that looks like the following:
|
||||||
|
|
||||||
|
@ -15,7 +15,6 @@ more-complex-and-more-flexible.
|
|||||||
- Leave `VLLM_CONFIGURE_LOGGING` unset or set `VLLM_CONFIGURE_LOGGING=1` and
|
- Leave `VLLM_CONFIGURE_LOGGING` unset or set `VLLM_CONFIGURE_LOGGING=1` and
|
||||||
set `VLLM_LOGGING_CONFIG_PATH=<path-to-logging-config.json>`
|
set `VLLM_LOGGING_CONFIG_PATH=<path-to-logging-config.json>`
|
||||||
|
|
||||||
|
|
||||||
## Logging Configuration Environment Variables
|
## Logging Configuration Environment Variables
|
||||||
|
|
||||||
### `VLLM_CONFIGURE_LOGGING`
|
### `VLLM_CONFIGURE_LOGGING`
|
||||||
@ -45,7 +44,6 @@ schema](https://docs.python.org/3/library/logging.config.html#dictionary-schema-
|
|||||||
If `VLLM_LOGGING_CONFIG_PATH` is specified, but `VLLM_CONFIGURE_LOGGING` is
|
If `VLLM_LOGGING_CONFIG_PATH` is specified, but `VLLM_CONFIGURE_LOGGING` is
|
||||||
disabled, an error will occur while starting vLLM.
|
disabled, an error will occur while starting vLLM.
|
||||||
|
|
||||||
|
|
||||||
## Examples
|
## Examples
|
||||||
|
|
||||||
### Example 1: Customize vLLM root logger
|
### Example 1: Customize vLLM root logger
|
||||||
@ -98,7 +96,6 @@ VLLM_LOGGING_CONFIG_PATH=/path/to/logging_config.json \
|
|||||||
vllm serve mistralai/Mistral-7B-v0.1 --max-model-len 2048
|
vllm serve mistralai/Mistral-7B-v0.1 --max-model-len 2048
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
### Example 2: Silence a particular vLLM logger
|
### Example 2: Silence a particular vLLM logger
|
||||||
|
|
||||||
To silence a particular vLLM logger, it is necessary to provide custom logging
|
To silence a particular vLLM logger, it is necessary to provide custom logging
|
||||||
@ -153,7 +150,6 @@ VLLM_LOGGING_CONFIG_PATH=/path/to/logging_config.json \
|
|||||||
vllm serve mistralai/Mistral-7B-v0.1 --max-model-len 2048
|
vllm serve mistralai/Mistral-7B-v0.1 --max-model-len 2048
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
### Example 3: Disable vLLM default logging configuration
|
### Example 3: Disable vLLM default logging configuration
|
||||||
|
|
||||||
To disable vLLM's default logging configuration and silence all vLLM loggers,
|
To disable vLLM's default logging configuration and silence all vLLM loggers,
|
||||||
@ -166,7 +162,6 @@ VLLM_CONFIGURE_LOGGING=0 \
|
|||||||
vllm serve mistralai/Mistral-7B-v0.1 --max-model-len 2048
|
vllm serve mistralai/Mistral-7B-v0.1 --max-model-len 2048
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
## Additional resources
|
## Additional resources
|
||||||
|
|
||||||
- [`logging.config` Dictionary Schema Details](https://docs.python.org/3/library/logging.config.html#dictionary-schema-details)
|
- [`logging.config` Dictionary Schema Details](https://docs.python.org/3/library/logging.config.html#dictionary-schema-details)
|
||||||
|
@ -14,8 +14,8 @@ The KV cache transfer contains three layer of abstractions:
|
|||||||
|
|
||||||
Why we need KV lookup buffer: FIFO pipe itself is not enough as prefill vLLM worker may process requests in a different order compared to decode vLLM worker. Say the QPS is really high, prefill worker may handle requests in order A -> B -> C, but the decode worker may process request C first. This is not the case that can be naturally handled by FIFO pipe, so we provide KV lookup buffer to help translate a FIFO pipe to a lookup buffer.
|
Why we need KV lookup buffer: FIFO pipe itself is not enough as prefill vLLM worker may process requests in a different order compared to decode vLLM worker. Say the QPS is really high, prefill worker may handle requests in order A -> B -> C, but the decode worker may process request C first. This is not the case that can be naturally handled by FIFO pipe, so we provide KV lookup buffer to help translate a FIFO pipe to a lookup buffer.
|
||||||
|
|
||||||
NOTE: KV pipe layer is bypassible: you can skip this layer if your distributed
|
NOTE: KV pipe layer is bypassible: you can skip this layer if your distributed
|
||||||
communication service already supports key-value-based lookup (like redis or
|
communication service already supports key-value-based lookup (like redis or
|
||||||
RDMA database).
|
RDMA database).
|
||||||
|
|
||||||
NOTE: If you want to not only transfer KV caches, but adjust the model execution flow of vLLM as well (for example, allow vLLM to receive KV caches on some tokens and do prefill on the remaining tokens), you can bypass both KV pipe layer and KV lookup buffer layer, and directly implement on KV connector layer. Bear in mind that as vLLM's model input is constantly changing, this implementation will likely be broken when vLLM has new updates.
|
NOTE: If you want to not only transfer KV caches, but adjust the model execution flow of vLLM as well (for example, allow vLLM to receive KV caches on some tokens and do prefill on the remaining tokens), you can bypass both KV pipe layer and KV lookup buffer layer, and directly implement on KV connector layer. Bear in mind that as vLLM's model input is constantly changing, this implementation will likely be broken when vLLM has new updates.
|
||||||
@ -27,4 +27,3 @@ The example usage is in [this file](../../../examples/online_serving/disaggregat
|
|||||||
Here is the diagram of how we run disaggretgated prefilling.
|
Here is the diagram of how we run disaggretgated prefilling.
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user