[CI/Build] Auto-fix Markdown files (#12941)
This commit is contained in:
parent
4c8dd12ef3
commit
8a69e0e20e
@ -1,16 +1,14 @@
|
|||||||
# vLLM benchmark suite
|
# vLLM benchmark suite
|
||||||
|
|
||||||
|
|
||||||
## Introduction
|
## Introduction
|
||||||
|
|
||||||
This directory contains two sets of benchmark for vllm.
|
This directory contains two sets of benchmark for vllm.
|
||||||
|
|
||||||
- Performance benchmark: benchmark vllm's performance under various workload, for **developers** to gain clarity on whether their PR improves/degrades vllm's performance
|
- Performance benchmark: benchmark vllm's performance under various workload, for **developers** to gain clarity on whether their PR improves/degrades vllm's performance
|
||||||
- Nightly benchmark: compare vllm's performance against alternatives (tgi, trt-llm and lmdeploy), for **the public** to know when to choose vllm.
|
- Nightly benchmark: compare vllm's performance against alternatives (tgi, trt-llm and lmdeploy), for **the public** to know when to choose vllm.
|
||||||
|
|
||||||
|
|
||||||
See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.
|
See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.
|
||||||
|
|
||||||
|
|
||||||
## Performance benchmark quick overview
|
## Performance benchmark quick overview
|
||||||
|
|
||||||
**Benchmarking Coverage**: latency, throughput and fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!), with different models.
|
**Benchmarking Coverage**: latency, throughput and fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!), with different models.
|
||||||
@ -19,7 +17,6 @@ See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performan
|
|||||||
|
|
||||||
**For benchmarking developers**: please try your best to constraint the duration of benchmarking to about 1 hr so that it won't take forever to run.
|
**For benchmarking developers**: please try your best to constraint the duration of benchmarking to about 1 hr so that it won't take forever to run.
|
||||||
|
|
||||||
|
|
||||||
## Nightly benchmark quick overview
|
## Nightly benchmark quick overview
|
||||||
|
|
||||||
**Benchmarking Coverage**: Fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!) on Llama-3 8B, 70B and Mixtral 8x7B.
|
**Benchmarking Coverage**: Fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!) on Llama-3 8B, 70B and Mixtral 8x7B.
|
||||||
@ -28,8 +25,6 @@ See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performan
|
|||||||
|
|
||||||
**Benchmarking Duration**: about 3.5hrs.
|
**Benchmarking Duration**: about 3.5hrs.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Trigger the benchmark
|
## Trigger the benchmark
|
||||||
|
|
||||||
Performance benchmark will be triggered when:
|
Performance benchmark will be triggered when:
|
||||||
@ -39,16 +34,11 @@ Performance benchmark will be triggered when:
|
|||||||
Nightly benchmark will be triggered when:
|
Nightly benchmark will be triggered when:
|
||||||
- Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label.
|
- Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Performance benchmark details
|
## Performance benchmark details
|
||||||
|
|
||||||
|
|
||||||
See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
|
See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
|
||||||
|
|
||||||
|
### Latency test
|
||||||
#### Latency test
|
|
||||||
|
|
||||||
Here is an example of one test inside `latency-tests.json`:
|
Here is an example of one test inside `latency-tests.json`:
|
||||||
|
|
||||||
@ -68,6 +58,7 @@ Here is an example of one test inside `latency-tests.json`:
|
|||||||
```
|
```
|
||||||
|
|
||||||
In this example:
|
In this example:
|
||||||
|
|
||||||
- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
|
- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
|
||||||
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
|
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
|
||||||
|
|
||||||
@ -75,16 +66,17 @@ Note that the performance numbers are highly sensitive to the value of the param
|
|||||||
|
|
||||||
WARNING: The benchmarking script will save json results by itself, so please do not configure `--output-json` parameter in the json file.
|
WARNING: The benchmarking script will save json results by itself, so please do not configure `--output-json` parameter in the json file.
|
||||||
|
|
||||||
|
### Throughput test
|
||||||
|
|
||||||
#### Throughput test
|
|
||||||
The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `benchmark_throughput.py`.
|
The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `benchmark_throughput.py`.
|
||||||
|
|
||||||
The number of this test is also stable -- a slight change on the value of this number might vary the performance numbers by a lot.
|
The number of this test is also stable -- a slight change on the value of this number might vary the performance numbers by a lot.
|
||||||
|
|
||||||
#### Serving test
|
### Serving test
|
||||||
|
|
||||||
We test the throughput by using `benchmark_serving.py` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:
|
We test the throughput by using `benchmark_serving.py` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:
|
||||||
|
|
||||||
```
|
```json
|
||||||
[
|
[
|
||||||
{
|
{
|
||||||
"test_name": "serving_llama8B_tp1_sharegpt",
|
"test_name": "serving_llama8B_tp1_sharegpt",
|
||||||
@ -109,6 +101,7 @@ We test the throughput by using `benchmark_serving.py` with request rate = inf t
|
|||||||
```
|
```
|
||||||
|
|
||||||
Inside this example:
|
Inside this example:
|
||||||
|
|
||||||
- The `test_name` attribute is also a unique identifier for the test. It must start with `serving_`.
|
- The `test_name` attribute is also a unique identifier for the test. It must start with `serving_`.
|
||||||
- The `server-parameters` includes the command line arguments for vLLM server.
|
- The `server-parameters` includes the command line arguments for vLLM server.
|
||||||
- The `client-parameters` includes the command line arguments for `benchmark_serving.py`.
|
- The `client-parameters` includes the command line arguments for `benchmark_serving.py`.
|
||||||
@ -118,36 +111,33 @@ The number of this test is less stable compared to the delay and latency benchma
|
|||||||
|
|
||||||
WARNING: The benchmarking script will save json results by itself, so please do not configure `--save-results` or other results-saving-related parameters in `serving-tests.json`.
|
WARNING: The benchmarking script will save json results by itself, so please do not configure `--save-results` or other results-saving-related parameters in `serving-tests.json`.
|
||||||
|
|
||||||
#### Visualizing the results
|
### Visualizing the results
|
||||||
|
|
||||||
The `convert-results-json-to-markdown.py` helps you put the benchmarking results inside a markdown table, by formatting [descriptions.md](tests/descriptions.md) with real benchmarking results.
|
The `convert-results-json-to-markdown.py` helps you put the benchmarking results inside a markdown table, by formatting [descriptions.md](tests/descriptions.md) with real benchmarking results.
|
||||||
You can find the result presented as a table inside the `buildkite/performance-benchmark` job page.
|
You can find the result presented as a table inside the `buildkite/performance-benchmark` job page.
|
||||||
If you do not see the table, please wait till the benchmark finish running.
|
If you do not see the table, please wait till the benchmark finish running.
|
||||||
The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.
|
The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.
|
||||||
The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking.
|
The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Nightly test details
|
## Nightly test details
|
||||||
|
|
||||||
See [nightly-descriptions.md](nightly-descriptions.md) for the detailed description on test workload, models and docker containers of benchmarking other llm engines.
|
See [nightly-descriptions.md](nightly-descriptions.md) for the detailed description on test workload, models and docker containers of benchmarking other llm engines.
|
||||||
|
|
||||||
|
### Workflow
|
||||||
#### Workflow
|
|
||||||
|
|
||||||
- The [nightly-pipeline.yaml](nightly-pipeline.yaml) specifies the docker containers for different LLM serving engines.
|
- The [nightly-pipeline.yaml](nightly-pipeline.yaml) specifies the docker containers for different LLM serving engines.
|
||||||
- Inside each container, we run [run-nightly-suite.sh](run-nightly-suite.sh), which will probe the serving engine of the current container.
|
- Inside each container, we run [run-nightly-suite.sh](run-nightly-suite.sh), which will probe the serving engine of the current container.
|
||||||
- The `run-nightly-suite.sh` will redirect the request to `tests/run-[llm serving engine name]-nightly.sh`, which parses the workload described in [nightly-tests.json](tests/nightly-tests.json) and performs the benchmark.
|
- The `run-nightly-suite.sh` will redirect the request to `tests/run-[llm serving engine name]-nightly.sh`, which parses the workload described in [nightly-tests.json](tests/nightly-tests.json) and performs the benchmark.
|
||||||
- At last, we run [scripts/plot-nightly-results.py](scripts/plot-nightly-results.py) to collect and plot the final benchmarking results, and update the results to buildkite.
|
- At last, we run [scripts/plot-nightly-results.py](scripts/plot-nightly-results.py) to collect and plot the final benchmarking results, and update the results to buildkite.
|
||||||
|
|
||||||
#### Nightly tests
|
### Nightly tests
|
||||||
|
|
||||||
In [nightly-tests.json](tests/nightly-tests.json), we include the command line arguments for benchmarking commands, together with the benchmarking test cases. The format is highly similar to performance benchmark.
|
In [nightly-tests.json](tests/nightly-tests.json), we include the command line arguments for benchmarking commands, together with the benchmarking test cases. The format is highly similar to performance benchmark.
|
||||||
|
|
||||||
#### Docker containers
|
### Docker containers
|
||||||
|
|
||||||
The docker containers for benchmarking are specified in `nightly-pipeline.yaml`.
|
The docker containers for benchmarking are specified in `nightly-pipeline.yaml`.
|
||||||
|
|
||||||
WARNING: the docker versions are HARD-CODED and SHOULD BE ALIGNED WITH `nightly-descriptions.md`. The docker versions need to be hard-coded as there are several version-specific bug fixes inside `tests/run-[llm serving engine name]-nightly.sh`.
|
WARNING: the docker versions are HARD-CODED and SHOULD BE ALIGNED WITH `nightly-descriptions.md`. The docker versions need to be hard-coded as there are several version-specific bug fixes inside `tests/run-[llm serving engine name]-nightly.sh`.
|
||||||
|
|
||||||
WARNING: populating `trt-llm` to latest version is not easy, as it requires updating several protobuf files in [tensorrt-demo](https://github.com/neuralmagic/tensorrt-demo.git).
|
WARNING: populating `trt-llm` to latest version is not easy, as it requires updating several protobuf files in [tensorrt-demo](https://github.com/neuralmagic/tensorrt-demo.git).
|
||||||
|
|
||||||
|
@ -9,14 +9,14 @@ This file contains the downloading link for benchmarking results.
|
|||||||
|
|
||||||
Please download the visualization scripts in the post
|
Please download the visualization scripts in the post
|
||||||
|
|
||||||
|
|
||||||
## Results reproduction
|
## Results reproduction
|
||||||
|
|
||||||
- Find the docker we use in `benchmarking pipeline`
|
- Find the docker we use in `benchmarking pipeline`
|
||||||
- Deploy the docker, and inside the docker:
|
- Deploy the docker, and inside the docker:
|
||||||
- Download `nightly-benchmarks.zip`.
|
- Download `nightly-benchmarks.zip`.
|
||||||
- In the same folder, run the following code
|
- In the same folder, run the following code:
|
||||||
```
|
|
||||||
|
```console
|
||||||
export HF_TOKEN=<your HF token>
|
export HF_TOKEN=<your HF token>
|
||||||
apt update
|
apt update
|
||||||
apt install -y git
|
apt install -y git
|
||||||
@ -25,4 +25,3 @@ VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-b
|
|||||||
```
|
```
|
||||||
|
|
||||||
And the results will be inside `./benchmarks/results`.
|
And the results will be inside `./benchmarks/results`.
|
||||||
|
|
||||||
|
@ -2,6 +2,7 @@
|
|||||||
# Nightly benchmark
|
# Nightly benchmark
|
||||||
|
|
||||||
This benchmark aims to:
|
This benchmark aims to:
|
||||||
|
|
||||||
- Provide performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and SGLang) leads in performance in what workload.
|
- Provide performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and SGLang) leads in performance in what workload.
|
||||||
- Be reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions.
|
- Be reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions.
|
||||||
|
|
||||||
@ -9,7 +10,6 @@ Latest results: [results link](https://blog.vllm.ai/2024/09/05/perf-update.html)
|
|||||||
|
|
||||||
Latest reproduction guilde: [github issue link](https://github.com/vllm-project/vllm/issues/8176)
|
Latest reproduction guilde: [github issue link](https://github.com/vllm-project/vllm/issues/8176)
|
||||||
|
|
||||||
|
|
||||||
## Setup
|
## Setup
|
||||||
|
|
||||||
- Docker images:
|
- Docker images:
|
||||||
@ -33,7 +33,7 @@ Latest reproduction guilde: [github issue link](https://github.com/vllm-project/
|
|||||||
- Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed.
|
- Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed.
|
||||||
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
|
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
|
||||||
|
|
||||||
# Known issues
|
## Known issues
|
||||||
|
|
||||||
- TRT-LLM crashes with Llama 3.1 8B [issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105).
|
- TRT-LLM crashes with Llama 3.1 8B [issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105).
|
||||||
- TGI does not support `ignore-eos` flag.
|
- TGI does not support `ignore-eos` flag.
|
@ -7,10 +7,8 @@
|
|||||||
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
|
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
|
||||||
- Evaluation metrics: end-to-end latency (mean, median, p99).
|
- Evaluation metrics: end-to-end latency (mean, median, p99).
|
||||||
|
|
||||||
|
|
||||||
{latency_tests_markdown_table}
|
{latency_tests_markdown_table}
|
||||||
|
|
||||||
|
|
||||||
## Throughput tests
|
## Throughput tests
|
||||||
|
|
||||||
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
|
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
|
||||||
@ -19,10 +17,8 @@
|
|||||||
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
|
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
|
||||||
- Evaluation metrics: throughput.
|
- Evaluation metrics: throughput.
|
||||||
|
|
||||||
|
|
||||||
{throughput_tests_markdown_table}
|
{throughput_tests_markdown_table}
|
||||||
|
|
||||||
|
|
||||||
## Serving tests
|
## Serving tests
|
||||||
|
|
||||||
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
|
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
|
||||||
@ -33,10 +29,8 @@
|
|||||||
- We also added a speculative decoding test for llama-3 70B, under QPS 2
|
- We also added a speculative decoding test for llama-3 70B, under QPS 2
|
||||||
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
|
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
|
||||||
|
|
||||||
|
|
||||||
{serving_tests_markdown_table}
|
{serving_tests_markdown_table}
|
||||||
|
|
||||||
|
|
||||||
## json version of the benchmarking tables
|
## json version of the benchmarking tables
|
||||||
|
|
||||||
This section contains the data of the markdown tables above in JSON format.
|
This section contains the data of the markdown tables above in JSON format.
|
||||||
@ -54,9 +48,9 @@ serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])
|
|||||||
```
|
```
|
||||||
|
|
||||||
The json string for all benchmarking tables:
|
The json string for all benchmarking tables:
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{benchmarking_results_in_json_string}
|
{benchmarking_results_in_json_string}
|
||||||
```
|
```
|
||||||
|
|
||||||
You can also check the raw experiment data in the Artifact tab of the Buildkite page.
|
You can also check the raw experiment data in the Artifact tab of the Buildkite page.
|
||||||
|
|
||||||
|
3
.github/PULL_REQUEST_TEMPLATE.md
vendored
3
.github/PULL_REQUEST_TEMPLATE.md
vendored
@ -2,4 +2,5 @@ FILL IN THE PR DESCRIPTION HERE
|
|||||||
|
|
||||||
FIX #xxxx (*link existing issues this PR will resolve*)
|
FIX #xxxx (*link existing issues this PR will resolve*)
|
||||||
|
|
||||||
**BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html **
|
<!--- pyml disable-next-line no-emphasis-as-heading -->
|
||||||
|
**BEFORE SUBMITTING, PLEASE READ <https://docs.vllm.ai/en/latest/contributing/overview.html>**
|
||||||
|
@ -33,7 +33,7 @@ repos:
|
|||||||
rev: v0.9.27
|
rev: v0.9.27
|
||||||
hooks:
|
hooks:
|
||||||
- id: pymarkdown
|
- id: pymarkdown
|
||||||
files: docs/.*
|
args: [fix]
|
||||||
- repo: https://github.com/rhysd/actionlint
|
- repo: https://github.com/rhysd/actionlint
|
||||||
rev: v1.7.7
|
rev: v1.7.7
|
||||||
hooks:
|
hooks:
|
||||||
|
@ -125,4 +125,3 @@ Community Impact Guidelines were inspired by
|
|||||||
For answers to common questions about this code of conduct, see the
|
For answers to common questions about this code of conduct, see the
|
||||||
[Contributor Covenant FAQ](https://www.contributor-covenant.org/faq). Translations are available at
|
[Contributor Covenant FAQ](https://www.contributor-covenant.org/faq). Translations are available at
|
||||||
[Contributor Covenant translations](https://www.contributor-covenant.org/translations).
|
[Contributor Covenant translations](https://www.contributor-covenant.org/translations).
|
||||||
|
|
||||||
|
14
README.md
14
README.md
@ -16,6 +16,7 @@ Easy, fast, and cheap LLM serving for everyone
|
|||||||
---
|
---
|
||||||
|
|
||||||
*Latest News* 🔥
|
*Latest News* 🔥
|
||||||
|
|
||||||
- [2025/01] We are excited to announce the alpha release of vLLM V1: A major architectural upgrade with 1.7x speedup! Clean code, optimized execution loop, zero-overhead prefix caching, enhanced multimodal support, and more. Please check out our blog post [here](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html).
|
- [2025/01] We are excited to announce the alpha release of vLLM V1: A major architectural upgrade with 1.7x speedup! Clean code, optimized execution loop, zero-overhead prefix caching, enhanced multimodal support, and more. Please check out our blog post [here](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html).
|
||||||
- [2025/01] We hosted [the eighth vLLM meetup](https://lu.ma/zep56hui) with Google Cloud! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1epVkt4Zu8Jz_S5OhEHPc798emsYh2BwYfRuDDVEF7u4/edit?usp=sharing), and Google Cloud team [here](https://drive.google.com/file/d/1h24pHewANyRL11xy5dXUbvRC9F9Kkjix/view?usp=sharing).
|
- [2025/01] We hosted [the eighth vLLM meetup](https://lu.ma/zep56hui) with Google Cloud! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1epVkt4Zu8Jz_S5OhEHPc798emsYh2BwYfRuDDVEF7u4/edit?usp=sharing), and Google Cloud team [here](https://drive.google.com/file/d/1h24pHewANyRL11xy5dXUbvRC9F9Kkjix/view?usp=sharing).
|
||||||
- [2024/12] vLLM joins [pytorch ecosystem](https://pytorch.org/blog/vllm-joins-pytorch)! Easy, Fast, and Cheap LLM Serving for Everyone!
|
- [2024/12] vLLM joins [pytorch ecosystem](https://pytorch.org/blog/vllm-joins-pytorch)! Easy, Fast, and Cheap LLM Serving for Everyone!
|
||||||
@ -33,7 +34,9 @@ Easy, fast, and cheap LLM serving for everyone
|
|||||||
- [2023/06] We officially released vLLM! FastChat-vLLM integration has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid-April. Check out our [blog post](https://vllm.ai).
|
- [2023/06] We officially released vLLM! FastChat-vLLM integration has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid-April. Check out our [blog post](https://vllm.ai).
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## About
|
## About
|
||||||
|
|
||||||
vLLM is a fast and easy-to-use library for LLM inference and serving.
|
vLLM is a fast and easy-to-use library for LLM inference and serving.
|
||||||
|
|
||||||
Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
|
Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
|
||||||
@ -127,6 +130,7 @@ We also have an official fundraising venue through [OpenCollective](https://open
|
|||||||
## Citation
|
## Citation
|
||||||
|
|
||||||
If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):
|
If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):
|
||||||
|
|
||||||
```bibtex
|
```bibtex
|
||||||
@inproceedings{kwon2023efficient,
|
@inproceedings{kwon2023efficient,
|
||||||
title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
|
title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
|
||||||
@ -138,11 +142,11 @@ If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs
|
|||||||
|
|
||||||
## Contact Us
|
## Contact Us
|
||||||
|
|
||||||
* For technical questions and feature requests, please use Github issues or discussions.
|
- For technical questions and feature requests, please use Github issues or discussions.
|
||||||
* For discussing with fellow users and coordinating contributions and development, please use Slack.
|
- For discussing with fellow users and coordinating contributions and development, please use Slack.
|
||||||
* For security disclosures, please use Github's security advisory feature.
|
- For security disclosures, please use Github's security advisory feature.
|
||||||
* For collaborations and partnerships, please contact us at vllm-questions AT lists.berkeley.edu.
|
- For collaborations and partnerships, please contact us at vllm-questions AT lists.berkeley.edu.
|
||||||
|
|
||||||
## Media Kit
|
## Media Kit
|
||||||
|
|
||||||
* If you wish to use vLLM's logo, please refer to [our media kit repo](https://github.com/vllm-project/media-kit).
|
- If you wish to use vLLM's logo, please refer to [our media kit repo](https://github.com/vllm-project/media-kit).
|
||||||
|
@ -3,6 +3,7 @@
|
|||||||
## Downloading the ShareGPT dataset
|
## Downloading the ShareGPT dataset
|
||||||
|
|
||||||
You can download the dataset by running:
|
You can download the dataset by running:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
|
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
|
||||||
```
|
```
|
||||||
@ -11,6 +12,7 @@ wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/r
|
|||||||
|
|
||||||
The json file refers to several image datasets (coco, llava, etc.). The benchmark scripts
|
The json file refers to several image datasets (coco, llava, etc.). The benchmark scripts
|
||||||
will ignore a datapoint if the referred image is missing.
|
will ignore a datapoint if the referred image is missing.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/resolve/main/sharegpt4v_instruct_gpt4-vision_cap100k.json
|
wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/resolve/main/sharegpt4v_instruct_gpt4-vision_cap100k.json
|
||||||
mkdir coco -p
|
mkdir coco -p
|
||||||
|
@ -1,6 +1,7 @@
|
|||||||
# CUTLASS Epilogues
|
# CUTLASS Epilogues
|
||||||
|
|
||||||
## Introduction
|
## Introduction
|
||||||
|
|
||||||
This document describes the various CUTLASS epilogues implemented for fusing de-quantization operations onto GEMMs.
|
This document describes the various CUTLASS epilogues implemented for fusing de-quantization operations onto GEMMs.
|
||||||
|
|
||||||
Currently, we only support symmetric quantization for weights,
|
Currently, we only support symmetric quantization for weights,
|
||||||
@ -8,10 +9,11 @@ and symmetric and asymmetric quantization for activations.
|
|||||||
Both can be quantized per-tensor or per-channel (weights) / per-token (activations).
|
Both can be quantized per-tensor or per-channel (weights) / per-token (activations).
|
||||||
|
|
||||||
There are 4 epilogues:
|
There are 4 epilogues:
|
||||||
1. ScaledEpilogue: symmetric quantization for activations, no bias.
|
|
||||||
1. ScaledEpilogueBias: symmetric quantization for activations, supports bias.
|
1. `ScaledEpilogue`: symmetric quantization for activations, no bias.
|
||||||
1. ScaledEpilogueAzp: asymmetric per-tensor quantization for activations, supports bias.
|
1. `ScaledEpilogueBias`: symmetric quantization for activations, supports bias.
|
||||||
1. ScaledEpilogueAzpPerToken: asymmetric per-token quantization for activations, supports bias.
|
1. `ScaledEpilogueAzp`: asymmetric per-tensor quantization for activations, supports bias.
|
||||||
|
1. `ScaledEpilogueAzpPerToken`: asymmetric per-token quantization for activations, supports bias.
|
||||||
|
|
||||||
We do not have epilogues for asymmetric quantization of activations without bias in order to reduce final binary size.
|
We do not have epilogues for asymmetric quantization of activations without bias in order to reduce final binary size.
|
||||||
Instead, if no bias is passed, the epilogue will use 0 as the bias.
|
Instead, if no bias is passed, the epilogue will use 0 as the bias.
|
||||||
@ -26,12 +28,15 @@ If $` \widehat X `$ is the quantized $` X `$, our matrices become the following
|
|||||||
```math
|
```math
|
||||||
A = s_a (\widehat A - J_a z_a)
|
A = s_a (\widehat A - J_a z_a)
|
||||||
```
|
```
|
||||||
|
|
||||||
```math
|
```math
|
||||||
B = s_b \widehat B
|
B = s_b \widehat B
|
||||||
```
|
```
|
||||||
|
|
||||||
```math
|
```math
|
||||||
D = A B + C
|
D = A B + C
|
||||||
```
|
```
|
||||||
|
|
||||||
```math
|
```math
|
||||||
D = s_a s_b \widehat D + C
|
D = s_a s_b \widehat D + C
|
||||||
```
|
```
|
||||||
@ -48,9 +53,11 @@ Expanding further, we can calculate $` \widehat D `$ as follows:
|
|||||||
```math
|
```math
|
||||||
A B = s_a ( \widehat A - J_a z_a ) s_b \widehat B
|
A B = s_a ( \widehat A - J_a z_a ) s_b \widehat B
|
||||||
```
|
```
|
||||||
|
|
||||||
```math
|
```math
|
||||||
A B = s_a s_b \left( \widehat A \widehat B - J_a z_a \widehat B \right)
|
A B = s_a s_b \left( \widehat A \widehat B - J_a z_a \widehat B \right)
|
||||||
```
|
```
|
||||||
|
|
||||||
```math
|
```math
|
||||||
\widehat D = \widehat A \widehat B - z_a J_a \widehat B
|
\widehat D = \widehat A \widehat B - z_a J_a \widehat B
|
||||||
```
|
```
|
||||||
@ -61,16 +68,19 @@ Each row of it is equal to $` \mathbf 1 \widehat B `$, which is a row-vector of
|
|||||||
|
|
||||||
## Epilogues
|
## Epilogues
|
||||||
|
|
||||||
### ScaledEpilogue
|
### `ScaledEpilogue`
|
||||||
|
|
||||||
This epilogue computes the symmetric quantization for activations without bias, meaning $` C = 0 `$ and $` z_a = 0 `$.
|
This epilogue computes the symmetric quantization for activations without bias, meaning $` C = 0 `$ and $` z_a = 0 `$.
|
||||||
The output of the GEMM is:
|
The output of the GEMM is:
|
||||||
|
|
||||||
```math
|
```math
|
||||||
\widehat D = \widehat A \widehat B
|
\widehat D = \widehat A \widehat B
|
||||||
```
|
```
|
||||||
|
|
||||||
```math
|
```math
|
||||||
D = s_a s_b \widehat D
|
D = s_a s_b \widehat D
|
||||||
```
|
```
|
||||||
|
|
||||||
```math
|
```math
|
||||||
D = s_a s_b \widehat A \widehat B
|
D = s_a s_b \widehat A \widehat B
|
||||||
```
|
```
|
||||||
@ -79,36 +89,42 @@ Epilogue parameters:
|
|||||||
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
|
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
|
||||||
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
|
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
|
||||||
|
|
||||||
### ScaledEpilogueBias
|
### `ScaledEpilogueBias`
|
||||||
|
|
||||||
This epilogue computes the symmetric quantization for activations with bias, meaning $` z_a = 0 `$.
|
This epilogue computes the symmetric quantization for activations with bias, meaning $` z_a = 0 `$.
|
||||||
The output of the GEMM is:
|
The output of the GEMM is:
|
||||||
|
|
||||||
```math
|
```math
|
||||||
\widehat D = \widehat A \widehat B
|
\widehat D = \widehat A \widehat B
|
||||||
```
|
```
|
||||||
|
|
||||||
```math
|
```math
|
||||||
D = s_a s_b \widehat D + C
|
D = s_a s_b \widehat D + C
|
||||||
```
|
```
|
||||||
|
|
||||||
```math
|
```math
|
||||||
D = s_a s_b \widehat A \widehat B + C
|
D = s_a s_b \widehat A \widehat B + C
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
Epilogue parameters:
|
Epilogue parameters:
|
||||||
|
|
||||||
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
|
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
|
||||||
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
|
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
|
||||||
- `bias` is the bias, is always per-channel (row-vector).
|
- `bias` is the bias, is always per-channel (row-vector).
|
||||||
|
|
||||||
### ScaledEpilogueAzp
|
### `ScaledEpilogueAzp`
|
||||||
|
|
||||||
This epilogue computes the asymmetric per-tensor quantization for activations with bias.
|
This epilogue computes the asymmetric per-tensor quantization for activations with bias.
|
||||||
The output of the GEMM is:
|
The output of the GEMM is:
|
||||||
|
|
||||||
```math
|
```math
|
||||||
\widehat D = \widehat A \widehat B - z_a J_a \widehat B
|
\widehat D = \widehat A \widehat B - z_a J_a \widehat B
|
||||||
```
|
```
|
||||||
|
|
||||||
```math
|
```math
|
||||||
D = s_a s_b \widehat D + C
|
D = s_a s_b \widehat D + C
|
||||||
```
|
```
|
||||||
|
|
||||||
```math
|
```math
|
||||||
D = s_a s_b \left( \widehat A \widehat B - z_a J_a \widehat B \right) + C
|
D = s_a s_b \left( \widehat A \widehat B - z_a J_a \widehat B \right) + C
|
||||||
```
|
```
|
||||||
@ -117,6 +133,7 @@ Because $` z_a `$ is a scalar, the zero-point term $` z_a J_a \widehat B `$ has
|
|||||||
That is precomputed and stored in `azp_with_adj` as a row-vector.
|
That is precomputed and stored in `azp_with_adj` as a row-vector.
|
||||||
|
|
||||||
Epilogue parameters:
|
Epilogue parameters:
|
||||||
|
|
||||||
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
|
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
|
||||||
- Generally this will be per-tensor as the zero-points are per-tensor.
|
- Generally this will be per-tensor as the zero-points are per-tensor.
|
||||||
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
|
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
|
||||||
@ -125,13 +142,15 @@ Epilogue parameters:
|
|||||||
|
|
||||||
To use these kernels efficiently, users must precompute the `azp_with_adj` term offline and pass it to the kernel.
|
To use these kernels efficiently, users must precompute the `azp_with_adj` term offline and pass it to the kernel.
|
||||||
|
|
||||||
### ScaledEpilogueAzpPerToken
|
### `ScaledEpilogueAzpPerToken`
|
||||||
|
|
||||||
This epilogue computes the asymmetric per-token quantization for activations with bias.
|
This epilogue computes the asymmetric per-token quantization for activations with bias.
|
||||||
|
|
||||||
The output of the GEMM is the same as above, but the $` z_a `$ is a column-vector.
|
The output of the GEMM is the same as above, but the $` z_a `$ is a column-vector.
|
||||||
That means the zero-point term $` z_a J_a \widehat B `$ becomes an outer product of $` z_a `$ and $` \mathbf 1 \widehat B `$.
|
That means the zero-point term $` z_a J_a \widehat B `$ becomes an outer product of $` z_a `$ and $` \mathbf 1 \widehat B `$.
|
||||||
|
|
||||||
Epilogue parameters:
|
Epilogue parameters:
|
||||||
|
|
||||||
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
|
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
|
||||||
- Generally this will be per-token as the zero-points are per-token.
|
- Generally this will be per-token as the zero-points are per-token.
|
||||||
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
|
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
|
||||||
@ -142,6 +161,7 @@ Epilogue parameters:
|
|||||||
To use these kernels efficiently, users must precompute the `azp_adj` term offline and pass it to the kernel.
|
To use these kernels efficiently, users must precompute the `azp_adj` term offline and pass it to the kernel.
|
||||||
|
|
||||||
The epilogue performs the following computation (where `Dq` is the raw quantized output of the GEMM):
|
The epilogue performs the following computation (where `Dq` is the raw quantized output of the GEMM):
|
||||||
```
|
|
||||||
|
```math
|
||||||
out = scale_a * scale_b * (Dq - azp_adj * azp) + bias
|
out = scale_a * scale_b * (Dq - azp_adj * azp) + bias
|
||||||
```
|
```
|
||||||
|
@ -6,7 +6,7 @@ Machete is a spiritual successor to the Marlin kernel but optimized for Hopper a
|
|||||||
|
|
||||||
Machete effectively performs
|
Machete effectively performs
|
||||||
|
|
||||||
```
|
```python
|
||||||
scale_type = w_s.dtype
|
scale_type = w_s.dtype
|
||||||
compute_type = a.dtype
|
compute_type = a.dtype
|
||||||
out = (w_q.to(scale_type) * w_s - w_z.to(scale_type)) @ a
|
out = (w_q.to(scale_type) * w_s - w_z.to(scale_type)) @ a
|
||||||
@ -24,7 +24,7 @@ applied.
|
|||||||
|
|
||||||
The main optimization within Machete is prepacking the weight matrix to more closely match the tensor core layouts, allowing for wider shared memory loads when loading the weight matrix. This means that the weight matrix must be prepacked before calling `machete_gemm`. The flow looks something like:
|
The main optimization within Machete is prepacking the weight matrix to more closely match the tensor core layouts, allowing for wider shared memory loads when loading the weight matrix. This means that the weight matrix must be prepacked before calling `machete_gemm`. The flow looks something like:
|
||||||
|
|
||||||
```
|
```python
|
||||||
from vllm import _custom_ops as ops
|
from vllm import _custom_ops as ops
|
||||||
|
|
||||||
...
|
...
|
||||||
|
@ -93,7 +93,6 @@ Currently, there are no pre-built ROCm wheels.
|
|||||||
|
|
||||||
This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation.
|
This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation.
|
||||||
|
|
||||||
<!--- pyml disable-num-lines 5 ul-indent-->
|
|
||||||
:::{tip}
|
:::{tip}
|
||||||
- Triton flash attention is used by default. For benchmarking purposes, it is recommended to run a warm up step before collecting perf numbers.
|
- Triton flash attention is used by default. For benchmarking purposes, it is recommended to run a warm up step before collecting perf numbers.
|
||||||
- Triton flash attention does not currently support sliding window attention. If using half precision, please use CK flash-attention for sliding window support.
|
- Triton flash attention does not currently support sliding window attention. If using half precision, please use CK flash-attention for sliding window support.
|
||||||
|
@ -23,20 +23,19 @@ We currently support `/v1/chat/completions`, `/v1/embeddings`, and `/v1/score` e
|
|||||||
- Install the token on your machine (Run `huggingface-cli login`).
|
- Install the token on your machine (Run `huggingface-cli login`).
|
||||||
- Get access to the gated model by [visiting the model card](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and agreeing to the terms and conditions.
|
- Get access to the gated model by [visiting the model card](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and agreeing to the terms and conditions.
|
||||||
|
|
||||||
|
|
||||||
## Example 1: Running with a local file
|
## Example 1: Running with a local file
|
||||||
|
|
||||||
### Step 1: Create your batch file
|
### Step 1: Create your batch file
|
||||||
|
|
||||||
To follow along with this example, you can download the example batch, or create your own batch file in your working directory.
|
To follow along with this example, you can download the example batch, or create your own batch file in your working directory.
|
||||||
|
|
||||||
```
|
```console
|
||||||
wget https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai/openai_example_batch.jsonl
|
wget https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai/openai_example_batch.jsonl
|
||||||
```
|
```
|
||||||
|
|
||||||
Once you've created your batch file it should look like this
|
Once you've created your batch file it should look like this
|
||||||
|
|
||||||
```
|
```console
|
||||||
$ cat offline_inference/openai/openai_example_batch.jsonl
|
$ cat offline_inference/openai/openai_example_batch.jsonl
|
||||||
{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_completion_tokens": 1000}}
|
{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_completion_tokens": 1000}}
|
||||||
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "Hello world!"}],"max_completion_tokens": 1000}}
|
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "Hello world!"}],"max_completion_tokens": 1000}}
|
||||||
@ -48,7 +47,7 @@ The batch running tool is designed to be used from the command line.
|
|||||||
|
|
||||||
You can run the batch with the following command, which will write its results to a file called `results.jsonl`
|
You can run the batch with the following command, which will write its results to a file called `results.jsonl`
|
||||||
|
|
||||||
```
|
```console
|
||||||
python -m vllm.entrypoints.openai.run_batch -i offline_inference/openai/openai_example_batch.jsonl -o results.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct
|
python -m vllm.entrypoints.openai.run_batch -i offline_inference/openai/openai_example_batch.jsonl -o results.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -56,7 +55,7 @@ python -m vllm.entrypoints.openai.run_batch -i offline_inference/openai/openai_e
|
|||||||
|
|
||||||
You should now have your results at `results.jsonl`. You can check your results by running `cat results.jsonl`
|
You should now have your results at `results.jsonl`. You can check your results by running `cat results.jsonl`
|
||||||
|
|
||||||
```
|
```console
|
||||||
$ cat results.jsonl
|
$ cat results.jsonl
|
||||||
{"id":"vllm-383d1c59835645aeb2e07d004d62a826","custom_id":"request-1","response":{"id":"cmpl-61c020e54b964d5a98fa7527bfcdd378","object":"chat.completion","created":1715633336,"model":"meta-llama/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! It's great to meet you! I'm here to help with any questions or tasks you may have. What's on your mind today?"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":25,"total_tokens":56,"completion_tokens":31}},"error":null}
|
{"id":"vllm-383d1c59835645aeb2e07d004d62a826","custom_id":"request-1","response":{"id":"cmpl-61c020e54b964d5a98fa7527bfcdd378","object":"chat.completion","created":1715633336,"model":"meta-llama/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! It's great to meet you! I'm here to help with any questions or tasks you may have. What's on your mind today?"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":25,"total_tokens":56,"completion_tokens":31}},"error":null}
|
||||||
{"id":"vllm-42e3d09b14b04568afa3f1797751a267","custom_id":"request-2","response":{"id":"cmpl-f44d049f6b3a42d4b2d7850bb1e31bcc","object":"chat.completion","created":1715633336,"model":"meta-llama/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"*silence*"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":27,"total_tokens":32,"completion_tokens":5}},"error":null}
|
{"id":"vllm-42e3d09b14b04568afa3f1797751a267","custom_id":"request-2","response":{"id":"cmpl-f44d049f6b3a42d4b2d7850bb1e31bcc","object":"chat.completion","created":1715633336,"model":"meta-llama/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"*silence*"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":27,"total_tokens":32,"completion_tokens":5}},"error":null}
|
||||||
@ -68,7 +67,7 @@ The batch runner supports remote input and output urls that are accessible via h
|
|||||||
|
|
||||||
For example, to run against our example input file located at `https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai/openai_example_batch.jsonl`, you can run
|
For example, to run against our example input file located at `https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai/openai_example_batch.jsonl`, you can run
|
||||||
|
|
||||||
```
|
```console
|
||||||
python -m vllm.entrypoints.openai.run_batch -i https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai/openai_example_batch.jsonl -o results.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct
|
python -m vllm.entrypoints.openai.run_batch -i https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai/openai_example_batch.jsonl -o results.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -89,13 +88,13 @@ To integrate with cloud blob storage, we recommend using presigned urls.
|
|||||||
|
|
||||||
To follow along with this example, you can download the example batch, or create your own batch file in your working directory.
|
To follow along with this example, you can download the example batch, or create your own batch file in your working directory.
|
||||||
|
|
||||||
```
|
```console
|
||||||
wget https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai/openai_example_batch.jsonl
|
wget https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai/openai_example_batch.jsonl
|
||||||
```
|
```
|
||||||
|
|
||||||
Once you've created your batch file it should look like this
|
Once you've created your batch file it should look like this
|
||||||
|
|
||||||
```
|
```console
|
||||||
$ cat offline_inference/openai/openai_example_batch.jsonl
|
$ cat offline_inference/openai/openai_example_batch.jsonl
|
||||||
{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_completion_tokens": 1000}}
|
{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_completion_tokens": 1000}}
|
||||||
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "Hello world!"}],"max_completion_tokens": 1000}}
|
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "Hello world!"}],"max_completion_tokens": 1000}}
|
||||||
@ -103,7 +102,7 @@ $ cat offline_inference/openai/openai_example_batch.jsonl
|
|||||||
|
|
||||||
Now upload your batch file to your S3 bucket.
|
Now upload your batch file to your S3 bucket.
|
||||||
|
|
||||||
```
|
```console
|
||||||
aws s3 cp offline_inference/openai/openai_example_batch.jsonl s3://MY_BUCKET/MY_INPUT_FILE.jsonl
|
aws s3 cp offline_inference/openai/openai_example_batch.jsonl s3://MY_BUCKET/MY_INPUT_FILE.jsonl
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -111,9 +110,9 @@ aws s3 cp offline_inference/openai/openai_example_batch.jsonl s3://MY_BUCKET/MY_
|
|||||||
|
|
||||||
Presigned urls can only be generated via the SDK. You can run the following python script to generate your presigned urls. Be sure to replace the `MY_BUCKET`, `MY_INPUT_FILE.jsonl`, and `MY_OUTPUT_FILE.jsonl` placeholders with your bucket and file names.
|
Presigned urls can only be generated via the SDK. You can run the following python script to generate your presigned urls. Be sure to replace the `MY_BUCKET`, `MY_INPUT_FILE.jsonl`, and `MY_OUTPUT_FILE.jsonl` placeholders with your bucket and file names.
|
||||||
|
|
||||||
(The script is adapted from https://github.com/awsdocs/aws-doc-sdk-examples/blob/main/python/example_code/s3/s3_basics/presigned_url.py)
|
(The script is adapted from <https://github.com/awsdocs/aws-doc-sdk-examples/blob/main/python/example_code/s3/s3_basics/presigned_url.py>)
|
||||||
|
|
||||||
```
|
```python
|
||||||
import boto3
|
import boto3
|
||||||
from botocore.exceptions import ClientError
|
from botocore.exceptions import ClientError
|
||||||
|
|
||||||
@ -149,7 +148,7 @@ print(f"{output_url=}")
|
|||||||
|
|
||||||
This script should output
|
This script should output
|
||||||
|
|
||||||
```
|
```text
|
||||||
input_url='https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_INPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091'
|
input_url='https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_INPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091'
|
||||||
output_url='https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_OUTPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091'
|
output_url='https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_OUTPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091'
|
||||||
```
|
```
|
||||||
@ -158,7 +157,7 @@ output_url='https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_OUTPUT_FILE.jsonl?AW
|
|||||||
|
|
||||||
You can now run the batch runner, using the urls generated in the previous section.
|
You can now run the batch runner, using the urls generated in the previous section.
|
||||||
|
|
||||||
```
|
```console
|
||||||
python -m vllm.entrypoints.openai.run_batch \
|
python -m vllm.entrypoints.openai.run_batch \
|
||||||
-i "https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_INPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091" \
|
-i "https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_INPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091" \
|
||||||
-o "https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_OUTPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091" \
|
-o "https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_OUTPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091" \
|
||||||
@ -169,7 +168,7 @@ python -m vllm.entrypoints.openai.run_batch \
|
|||||||
|
|
||||||
Your results are now on S3. You can view them in your terminal by running
|
Your results are now on S3. You can view them in your terminal by running
|
||||||
|
|
||||||
```
|
```console
|
||||||
aws s3 cp s3://MY_BUCKET/MY_OUTPUT_FILE.jsonl -
|
aws s3 cp s3://MY_BUCKET/MY_OUTPUT_FILE.jsonl -
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -183,7 +182,7 @@ aws s3 cp s3://MY_BUCKET/MY_OUTPUT_FILE.jsonl -
|
|||||||
|
|
||||||
Add embedding requests to your batch file. The following is an example:
|
Add embedding requests to your batch file. The following is an example:
|
||||||
|
|
||||||
```
|
```text
|
||||||
{"custom_id": "request-1", "method": "POST", "url": "/v1/embeddings", "body": {"model": "intfloat/e5-mistral-7b-instruct", "input": "You are a helpful assistant."}}
|
{"custom_id": "request-1", "method": "POST", "url": "/v1/embeddings", "body": {"model": "intfloat/e5-mistral-7b-instruct", "input": "You are a helpful assistant."}}
|
||||||
{"custom_id": "request-2", "method": "POST", "url": "/v1/embeddings", "body": {"model": "intfloat/e5-mistral-7b-instruct", "input": "You are an unhelpful assistant."}}
|
{"custom_id": "request-2", "method": "POST", "url": "/v1/embeddings", "body": {"model": "intfloat/e5-mistral-7b-instruct", "input": "You are an unhelpful assistant."}}
|
||||||
```
|
```
|
||||||
@ -198,7 +197,7 @@ You can run the batch using the same command as in earlier examples.
|
|||||||
|
|
||||||
You can check your results by running `cat results.jsonl`
|
You can check your results by running `cat results.jsonl`
|
||||||
|
|
||||||
```
|
```console
|
||||||
$ cat results.jsonl
|
$ cat results.jsonl
|
||||||
{"id":"vllm-db0f71f7dec244e6bce530e0b4ef908b","custom_id":"request-1","response":{"status_code":200,"request_id":"vllm-batch-3580bf4d4ae54d52b67eee266a6eab20","body":{"id":"embd-33ac2efa7996430184461f2e38529746","object":"list","created":444647,"model":"intfloat/e5-mistral-7b-instruct","data":[{"index":0,"object":"embedding","embedding":[0.016204833984375,0.0092010498046875,0.0018358230590820312,-0.0028228759765625,0.001422882080078125,-0.0031147003173828125,...]}],"usage":{"prompt_tokens":8,"total_tokens":8,"completion_tokens":0}}},"error":null}
|
{"id":"vllm-db0f71f7dec244e6bce530e0b4ef908b","custom_id":"request-1","response":{"status_code":200,"request_id":"vllm-batch-3580bf4d4ae54d52b67eee266a6eab20","body":{"id":"embd-33ac2efa7996430184461f2e38529746","object":"list","created":444647,"model":"intfloat/e5-mistral-7b-instruct","data":[{"index":0,"object":"embedding","embedding":[0.016204833984375,0.0092010498046875,0.0018358230590820312,-0.0028228759765625,0.001422882080078125,-0.0031147003173828125,...]}],"usage":{"prompt_tokens":8,"total_tokens":8,"completion_tokens":0}}},"error":null}
|
||||||
...
|
...
|
||||||
@ -214,7 +213,7 @@ $ cat results.jsonl
|
|||||||
|
|
||||||
Add score requests to your batch file. The following is an example:
|
Add score requests to your batch file. The following is an example:
|
||||||
|
|
||||||
```
|
```text
|
||||||
{"custom_id": "request-1", "method": "POST", "url": "/v1/score", "body": {"model": "BAAI/bge-reranker-v2-m3", "text_1": "What is the capital of France?", "text_2": ["The capital of Brazil is Brasilia.", "The capital of France is Paris."]}}
|
{"custom_id": "request-1", "method": "POST", "url": "/v1/score", "body": {"model": "BAAI/bge-reranker-v2-m3", "text_1": "What is the capital of France?", "text_2": ["The capital of Brazil is Brasilia.", "The capital of France is Paris."]}}
|
||||||
{"custom_id": "request-2", "method": "POST", "url": "/v1/score", "body": {"model": "BAAI/bge-reranker-v2-m3", "text_1": "What is the capital of France?", "text_2": ["The capital of Brazil is Brasilia.", "The capital of France is Paris."]}}
|
{"custom_id": "request-2", "method": "POST", "url": "/v1/score", "body": {"model": "BAAI/bge-reranker-v2-m3", "text_1": "What is the capital of France?", "text_2": ["The capital of Brazil is Brasilia.", "The capital of France is Paris."]}}
|
||||||
```
|
```
|
||||||
@ -229,7 +228,7 @@ You can run the batch using the same command as in earlier examples.
|
|||||||
|
|
||||||
You can check your results by running `cat results.jsonl`
|
You can check your results by running `cat results.jsonl`
|
||||||
|
|
||||||
```
|
```console
|
||||||
$ cat results.jsonl
|
$ cat results.jsonl
|
||||||
{"id":"vllm-f87c5c4539184f618e555744a2965987","custom_id":"request-1","response":{"status_code":200,"request_id":"vllm-batch-806ab64512e44071b37d3f7ccd291413","body":{"id":"score-4ee45236897b4d29907d49b01298cdb1","object":"list","created":1737847944,"model":"BAAI/bge-reranker-v2-m3","data":[{"index":0,"object":"score","score":0.0010900497436523438},{"index":1,"object":"score","score":1.0}],"usage":{"prompt_tokens":37,"total_tokens":37,"completion_tokens":0,"prompt_tokens_details":null}}},"error":null}
|
{"id":"vllm-f87c5c4539184f618e555744a2965987","custom_id":"request-1","response":{"status_code":200,"request_id":"vllm-batch-806ab64512e44071b37d3f7ccd291413","body":{"id":"score-4ee45236897b4d29907d49b01298cdb1","object":"list","created":1737847944,"model":"BAAI/bge-reranker-v2-m3","data":[{"index":0,"object":"score","score":0.0010900497436523438},{"index":1,"object":"score","score":1.0}],"usage":{"prompt_tokens":37,"total_tokens":37,"completion_tokens":0,"prompt_tokens_details":null}}},"error":null}
|
||||||
{"id":"vllm-41990c51a26d4fac8419077f12871099","custom_id":"request-2","response":{"status_code":200,"request_id":"vllm-batch-73ce66379026482699f81974e14e1e99","body":{"id":"score-13f2ffe6ba40460fbf9f7f00ad667d75","object":"list","created":1737847944,"model":"BAAI/bge-reranker-v2-m3","data":[{"index":0,"object":"score","score":0.001094818115234375},{"index":1,"object":"score","score":1.0}],"usage":{"prompt_tokens":37,"total_tokens":37,"completion_tokens":0,"prompt_tokens_details":null}}},"error":null}
|
{"id":"vllm-41990c51a26d4fac8419077f12871099","custom_id":"request-2","response":{"status_code":200,"request_id":"vllm-batch-73ce66379026482699f81974e14e1e99","body":{"id":"score-13f2ffe6ba40460fbf9f7f00ad667d75","object":"list","created":1737847944,"model":"BAAI/bge-reranker-v2-m3","data":[{"index":0,"object":"score","score":0.001094818115234375},{"index":1,"object":"score","score":1.0}],"usage":{"prompt_tokens":37,"total_tokens":37,"completion_tokens":0,"prompt_tokens_details":null}}},"error":null}
|
||||||
|
@ -29,7 +29,6 @@ python3 profiling.py \
|
|||||||
--profile-result-dir profiles
|
--profile-result-dir profiles
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
### Generate Decode Trace
|
### Generate Decode Trace
|
||||||
|
|
||||||
This example runs Llama 3.1 70B with a batch of 32 requests where each has 1 input token and 128 output tokens. This is set up in attempt to profile just the 32 decodes running in parallel by having an extremely small prefill of 1 token and setting `VLLM_TPU_PROFILE_DELAY_MS=1000` to skip the first second of inference (hopefully prefill).
|
This example runs Llama 3.1 70B with a batch of 32 requests where each has 1 input token and 128 output tokens. This is set up in attempt to profile just the 32 decodes running in parallel by having an extremely small prefill of 1 token and setting `VLLM_TPU_PROFILE_DELAY_MS=1000` to skip the first second of inference (hopefully prefill).
|
||||||
@ -51,17 +50,18 @@ python3 profiling.py \
|
|||||||
--max-model-len 2048 --tensor-parallel-size 8
|
--max-model-len 2048 --tensor-parallel-size 8
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
## Visualizing the profiles
|
## Visualizing the profiles
|
||||||
|
|
||||||
Once you have collected your profiles with this script, you can visualize them using [TensorBoard](https://cloud.google.com/tpu/docs/pytorch-xla-performance-profiling-tpu-vm).
|
Once you have collected your profiles with this script, you can visualize them using [TensorBoard](https://cloud.google.com/tpu/docs/pytorch-xla-performance-profiling-tpu-vm).
|
||||||
|
|
||||||
Here are most likely the dependencies you need to install:
|
Here are most likely the dependencies you need to install:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install tensorflow-cpu tensorboard-plugin-profile etils importlib_resources
|
pip install tensorflow-cpu tensorboard-plugin-profile etils importlib_resources
|
||||||
```
|
```
|
||||||
|
|
||||||
Then you just need to point TensorBoard to the directory where you saved the profiles and visit `http://localhost:6006/` in your browser:
|
Then you just need to point TensorBoard to the directory where you saved the profiles and visit `http://localhost:6006/` in your browser:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
tensorboard --logdir profiles/ --port 6006
|
tensorboard --logdir profiles/ --port 6006
|
||||||
```
|
```
|
@ -1,7 +1,8 @@
|
|||||||
# Setup OpenTelemetry POC
|
# Setup OpenTelemetry POC
|
||||||
|
|
||||||
1. Install OpenTelemetry packages:
|
1. Install OpenTelemetry packages:
|
||||||
```
|
|
||||||
|
```console
|
||||||
pip install \
|
pip install \
|
||||||
'opentelemetry-sdk>=1.26.0,<1.27.0' \
|
'opentelemetry-sdk>=1.26.0,<1.27.0' \
|
||||||
'opentelemetry-api>=1.26.0,<1.27.0' \
|
'opentelemetry-api>=1.26.0,<1.27.0' \
|
||||||
@ -10,7 +11,8 @@
|
|||||||
```
|
```
|
||||||
|
|
||||||
1. Start Jaeger in a docker container:
|
1. Start Jaeger in a docker container:
|
||||||
```
|
|
||||||
|
```console
|
||||||
# From: https://www.jaegertracing.io/docs/1.57/getting-started/
|
# From: https://www.jaegertracing.io/docs/1.57/getting-started/
|
||||||
docker run --rm --name jaeger \
|
docker run --rm --name jaeger \
|
||||||
-e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
|
-e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
|
||||||
@ -28,19 +30,23 @@
|
|||||||
```
|
```
|
||||||
|
|
||||||
1. In a new shell, export Jaeger IP:
|
1. In a new shell, export Jaeger IP:
|
||||||
```
|
|
||||||
|
```console
|
||||||
export JAEGER_IP=$(docker inspect --format '{{ .NetworkSettings.IPAddress }}' jaeger)
|
export JAEGER_IP=$(docker inspect --format '{{ .NetworkSettings.IPAddress }}' jaeger)
|
||||||
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=grpc://$JAEGER_IP:4317
|
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=grpc://$JAEGER_IP:4317
|
||||||
```
|
```
|
||||||
|
|
||||||
Then set vLLM's service name for OpenTelemetry, enable insecure connections to Jaeger and run vLLM:
|
Then set vLLM's service name for OpenTelemetry, enable insecure connections to Jaeger and run vLLM:
|
||||||
```
|
|
||||||
|
```console
|
||||||
export OTEL_SERVICE_NAME="vllm-server"
|
export OTEL_SERVICE_NAME="vllm-server"
|
||||||
export OTEL_EXPORTER_OTLP_TRACES_INSECURE=true
|
export OTEL_EXPORTER_OTLP_TRACES_INSECURE=true
|
||||||
vllm serve facebook/opt-125m --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
|
vllm serve facebook/opt-125m --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
|
||||||
```
|
```
|
||||||
|
|
||||||
1. In a new shell, send requests with trace context from a dummy client
|
1. In a new shell, send requests with trace context from a dummy client
|
||||||
```
|
|
||||||
|
```console
|
||||||
export JAEGER_IP=$(docker inspect --format '{{ .NetworkSettings.IPAddress }}' jaeger)
|
export JAEGER_IP=$(docker inspect --format '{{ .NetworkSettings.IPAddress }}' jaeger)
|
||||||
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=grpc://$JAEGER_IP:4317
|
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=grpc://$JAEGER_IP:4317
|
||||||
export OTEL_EXPORTER_OTLP_TRACES_INSECURE=true
|
export OTEL_EXPORTER_OTLP_TRACES_INSECURE=true
|
||||||
@ -48,7 +54,7 @@
|
|||||||
python dummy_client.py
|
python dummy_client.py
|
||||||
```
|
```
|
||||||
|
|
||||||
1. Open Jaeger webui: http://localhost:16686/
|
1. Open Jaeger webui: <http://localhost:16686/>
|
||||||
|
|
||||||
In the search pane, select `vllm-server` service and hit `Find Traces`. You should get a list of traces, one for each request.
|
In the search pane, select `vllm-server` service and hit `Find Traces`. You should get a list of traces, one for each request.
|
||||||

|

|
||||||
@ -57,23 +63,29 @@
|
|||||||

|

|
||||||
|
|
||||||
## Exporter Protocol
|
## Exporter Protocol
|
||||||
|
|
||||||
OpenTelemetry supports either `grpc` or `http/protobuf` as the transport protocol for trace data in the exporter.
|
OpenTelemetry supports either `grpc` or `http/protobuf` as the transport protocol for trace data in the exporter.
|
||||||
By default, `grpc` is used. To set `http/protobuf` as the protocol, configure the `OTEL_EXPORTER_OTLP_TRACES_PROTOCOL` environment variable as follows:
|
By default, `grpc` is used. To set `http/protobuf` as the protocol, configure the `OTEL_EXPORTER_OTLP_TRACES_PROTOCOL` environment variable as follows:
|
||||||
```
|
|
||||||
|
```console
|
||||||
export OTEL_EXPORTER_OTLP_TRACES_PROTOCOL=http/protobuf
|
export OTEL_EXPORTER_OTLP_TRACES_PROTOCOL=http/protobuf
|
||||||
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://$JAEGER_IP:4318/v1/traces
|
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://$JAEGER_IP:4318/v1/traces
|
||||||
vllm serve facebook/opt-125m --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
|
vllm serve facebook/opt-125m --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
|
||||||
```
|
```
|
||||||
|
|
||||||
## Instrumentation of FastAPI
|
## Instrumentation of FastAPI
|
||||||
|
|
||||||
OpenTelemetry allows automatic instrumentation of FastAPI.
|
OpenTelemetry allows automatic instrumentation of FastAPI.
|
||||||
|
|
||||||
1. Install the instrumentation library
|
1. Install the instrumentation library
|
||||||
```
|
|
||||||
|
```console
|
||||||
pip install opentelemetry-instrumentation-fastapi
|
pip install opentelemetry-instrumentation-fastapi
|
||||||
```
|
```
|
||||||
|
|
||||||
1. Run vLLM with `opentelemetry-instrument`
|
1. Run vLLM with `opentelemetry-instrument`
|
||||||
```
|
|
||||||
|
```console
|
||||||
opentelemetry-instrument vllm serve facebook/opt-125m
|
opentelemetry-instrument vllm serve facebook/opt-125m
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -3,12 +3,14 @@
|
|||||||
This is a simple example that shows you how to connect vLLM metric logging to the Prometheus/Grafana stack. For this example, we launch Prometheus and Grafana via Docker. You can checkout other methods through [Prometheus](https://prometheus.io/) and [Grafana](https://grafana.com/) websites.
|
This is a simple example that shows you how to connect vLLM metric logging to the Prometheus/Grafana stack. For this example, we launch Prometheus and Grafana via Docker. You can checkout other methods through [Prometheus](https://prometheus.io/) and [Grafana](https://grafana.com/) websites.
|
||||||
|
|
||||||
Install:
|
Install:
|
||||||
|
|
||||||
- [`docker`](https://docs.docker.com/engine/install/)
|
- [`docker`](https://docs.docker.com/engine/install/)
|
||||||
- [`docker compose`](https://docs.docker.com/compose/install/linux/#install-using-the-repository)
|
- [`docker compose`](https://docs.docker.com/compose/install/linux/#install-using-the-repository)
|
||||||
|
|
||||||
## Launch
|
## Launch
|
||||||
|
|
||||||
Prometheus metric logging is enabled by default in the OpenAI-compatible server. Launch via the entrypoint:
|
Prometheus metric logging is enabled by default in the OpenAI-compatible server. Launch via the entrypoint:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
vllm serve mistralai/Mistral-7B-v0.1 \
|
vllm serve mistralai/Mistral-7B-v0.1 \
|
||||||
--max-model-len 2048 \
|
--max-model-len 2048 \
|
||||||
@ -16,11 +18,13 @@ vllm serve mistralai/Mistral-7B-v0.1 \
|
|||||||
```
|
```
|
||||||
|
|
||||||
Launch Prometheus and Grafana servers with `docker compose`:
|
Launch Prometheus and Grafana servers with `docker compose`:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
docker compose up
|
docker compose up
|
||||||
```
|
```
|
||||||
|
|
||||||
Submit some sample requests to the server:
|
Submit some sample requests to the server:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
|
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
|
||||||
|
|
||||||
|
@ -15,7 +15,6 @@ more-complex-and-more-flexible.
|
|||||||
- Leave `VLLM_CONFIGURE_LOGGING` unset or set `VLLM_CONFIGURE_LOGGING=1` and
|
- Leave `VLLM_CONFIGURE_LOGGING` unset or set `VLLM_CONFIGURE_LOGGING=1` and
|
||||||
set `VLLM_LOGGING_CONFIG_PATH=<path-to-logging-config.json>`
|
set `VLLM_LOGGING_CONFIG_PATH=<path-to-logging-config.json>`
|
||||||
|
|
||||||
|
|
||||||
## Logging Configuration Environment Variables
|
## Logging Configuration Environment Variables
|
||||||
|
|
||||||
### `VLLM_CONFIGURE_LOGGING`
|
### `VLLM_CONFIGURE_LOGGING`
|
||||||
@ -45,7 +44,6 @@ schema](https://docs.python.org/3/library/logging.config.html#dictionary-schema-
|
|||||||
If `VLLM_LOGGING_CONFIG_PATH` is specified, but `VLLM_CONFIGURE_LOGGING` is
|
If `VLLM_LOGGING_CONFIG_PATH` is specified, but `VLLM_CONFIGURE_LOGGING` is
|
||||||
disabled, an error will occur while starting vLLM.
|
disabled, an error will occur while starting vLLM.
|
||||||
|
|
||||||
|
|
||||||
## Examples
|
## Examples
|
||||||
|
|
||||||
### Example 1: Customize vLLM root logger
|
### Example 1: Customize vLLM root logger
|
||||||
@ -98,7 +96,6 @@ VLLM_LOGGING_CONFIG_PATH=/path/to/logging_config.json \
|
|||||||
vllm serve mistralai/Mistral-7B-v0.1 --max-model-len 2048
|
vllm serve mistralai/Mistral-7B-v0.1 --max-model-len 2048
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
### Example 2: Silence a particular vLLM logger
|
### Example 2: Silence a particular vLLM logger
|
||||||
|
|
||||||
To silence a particular vLLM logger, it is necessary to provide custom logging
|
To silence a particular vLLM logger, it is necessary to provide custom logging
|
||||||
@ -153,7 +150,6 @@ VLLM_LOGGING_CONFIG_PATH=/path/to/logging_config.json \
|
|||||||
vllm serve mistralai/Mistral-7B-v0.1 --max-model-len 2048
|
vllm serve mistralai/Mistral-7B-v0.1 --max-model-len 2048
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
### Example 3: Disable vLLM default logging configuration
|
### Example 3: Disable vLLM default logging configuration
|
||||||
|
|
||||||
To disable vLLM's default logging configuration and silence all vLLM loggers,
|
To disable vLLM's default logging configuration and silence all vLLM loggers,
|
||||||
@ -166,7 +162,6 @@ VLLM_CONFIGURE_LOGGING=0 \
|
|||||||
vllm serve mistralai/Mistral-7B-v0.1 --max-model-len 2048
|
vllm serve mistralai/Mistral-7B-v0.1 --max-model-len 2048
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
## Additional resources
|
## Additional resources
|
||||||
|
|
||||||
- [`logging.config` Dictionary Schema Details](https://docs.python.org/3/library/logging.config.html#dictionary-schema-details)
|
- [`logging.config` Dictionary Schema Details](https://docs.python.org/3/library/logging.config.html#dictionary-schema-details)
|
||||||
|
@ -27,4 +27,3 @@ The example usage is in [this file](../../../examples/online_serving/disaggregat
|
|||||||
Here is the diagram of how we run disaggretgated prefilling.
|
Here is the diagram of how we run disaggretgated prefilling.
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user