vllm/.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md


## Latency tests

- Input length: 32 tokens.
- Output length: 128 tokens.
- Batch size: fixed (8).
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: end-to-end latency (mean, median, p99).

{latency_tests_markdown_table}

## Throughput tests

- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm to achieve maximum throughput.
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: throughput.

{throughput_tests_markdown_table}

## Serving tests

- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- We also added a speculative decoding test for llama-3 70B, under QPS 2
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).

{serving_tests_markdown_table}

## json version of the benchmarking tables

This section contains the data of the markdown tables above in JSON format.
You can load the benchmarking tables into pandas dataframes as follows:

```python
import json
import pandas as pd

benchmarking_results_json = """The json string"""
benchmarking_results = json.loads(benchmarking_results_json)
latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])
throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])
serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])
```

The json string for all benchmarking tables:

```json
{benchmarking_results_in_json_string}
```

You can also check the raw experiment data in the Artifact tab of the Buildkite page.
[CI] the readability of benchmarking and prepare for dashboard (#5571) [CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard (#5571) 2024-06-17 11:41:08 -07:00
			`## Latency tests`

			`- Input length: 32 tokens.`
			`- Output length: 128 tokens.`
			`- Batch size: fixed (8).`
[CI] Fix crashes of performance benchmark (#7500) 2024-08-16 08:08:45 -07:00			`- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.`
[CI] the readability of benchmarking and prepare for dashboard (#5571) [CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard (#5571) 2024-06-17 11:41:08 -07:00			`- Evaluation metrics: end-to-end latency (mean, median, p99).`

			`{latency_tests_markdown_table}`

[CI] Fix crashes of performance benchmark (#7500) 2024-08-16 08:08:45 -07:00			`## Throughput tests`
[CI] the readability of benchmarking and prepare for dashboard (#5571) [CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard (#5571) 2024-06-17 11:41:08 -07:00
			`- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).`
			`- Output length: the corresponding output length of these 200 prompts.`
			`- Batch size: dynamically determined by vllm to achieve maximum throughput.`
[CI] Fix crashes of performance benchmark (#7500) 2024-08-16 08:08:45 -07:00			`- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.`
[CI] the readability of benchmarking and prepare for dashboard (#5571) [CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard (#5571) 2024-06-17 11:41:08 -07:00			`- Evaluation metrics: throughput.`

			`{throughput_tests_markdown_table}`

[CI] Fix crashes of performance benchmark (#7500) 2024-08-16 08:08:45 -07:00			`## Serving tests`
[CI] the readability of benchmarking and prepare for dashboard (#5571) [CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard (#5571) 2024-06-17 11:41:08 -07:00
			`- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).`
			`- Output length: the corresponding output length of these 200 prompts.`
			`- Batch size: dynamically determined by vllm and the arrival pattern of the requests.`
			`- Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).`
[CI] Fix crashes of performance benchmark (#7500) 2024-08-16 08:08:45 -07:00			`- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.`
			`- We also added a speculative decoding test for llama-3 70B, under QPS 2`
[CI] the readability of benchmarking and prepare for dashboard (#5571) [CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard (#5571) 2024-06-17 11:41:08 -07:00			`- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).`

			`{serving_tests_markdown_table}`

			`## json version of the benchmarking tables`

[CI/Build] Auto-fix Markdown files (#12941) 2025-02-08 20:25:15 +08:00			`This section contains the data of the markdown tables above in JSON format.`
[CI] the readability of benchmarking and prepare for dashboard (#5571) [CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard (#5571) 2024-06-17 11:41:08 -07:00			`You can load the benchmarking tables into pandas dataframes as follows:`

			```python
			`import json`
			`import pandas as pd`

			`benchmarking_results_json = """The json string"""`
			`benchmarking_results = json.loads(benchmarking_results_json)`
			`latency_results = pd.DataFrame.from_dict(benchmarking_results["latency"])`
			`throughput_results = pd.DataFrame.from_dict(benchmarking_results["throughput"])`
			`serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])`
			```

			`The json string for all benchmarking tables:`
[CI/Build] Auto-fix Markdown files (#12941) 2025-02-08 20:25:15 +08:00
[CI] the readability of benchmarking and prepare for dashboard (#5571) [CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard (#5571) 2024-06-17 11:41:08 -07:00			```json
			`{benchmarking_results_in_json_string}`
			```

			`You can also check the raw experiment data in the Artifact tab of the Buildkite page.`