2023-06-17 03:07:40 -07:00
|
|
|
|
# Benchmarking vLLM
|
2023-05-28 03:20:05 -07:00
|
|
|
|
|
2025-03-11 19:23:04 -07:00
|
|
|
|
This README guides you through running benchmark tests with the extensive
|
|
|
|
|
datasets supported on vLLM. It’s a living document, updated as new features and datasets
|
|
|
|
|
become available.
|
2023-05-28 03:20:05 -07:00
|
|
|
|
|
2025-03-11 19:23:04 -07:00
|
|
|
|
## Dataset Overview
|
|
|
|
|
|
|
|
|
|
<table style="width:100%; border-collapse: collapse;">
|
|
|
|
|
<thead>
|
|
|
|
|
<tr>
|
|
|
|
|
<th style="width:15%; text-align: left;">Dataset</th>
|
|
|
|
|
<th style="width:10%; text-align: center;">Online</th>
|
|
|
|
|
<th style="width:10%; text-align: center;">Offline</th>
|
|
|
|
|
<th style="width:65%; text-align: left;">Data Path</th>
|
|
|
|
|
</tr>
|
|
|
|
|
</thead>
|
|
|
|
|
<tbody>
|
|
|
|
|
<tr>
|
|
|
|
|
<td><strong>ShareGPT</strong></td>
|
|
|
|
|
<td style="text-align: center;">✅</td>
|
|
|
|
|
<td style="text-align: center;">✅</td>
|
|
|
|
|
<td><code>wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json</code></td>
|
|
|
|
|
</tr>
|
|
|
|
|
<tr>
|
|
|
|
|
<td><strong>BurstGPT</strong></td>
|
|
|
|
|
<td style="text-align: center;">✅</td>
|
|
|
|
|
<td style="text-align: center;">✅</td>
|
|
|
|
|
<td><code>wget https://github.com/HPMLL/BurstGPT/releases/download/v1.1/BurstGPT_without_fails_2.csv</code></td>
|
|
|
|
|
</tr>
|
|
|
|
|
<tr>
|
|
|
|
|
<td><strong>Sonnet</strong></td>
|
|
|
|
|
<td style="text-align: center;">✅</td>
|
|
|
|
|
<td style="text-align: center;">✅</td>
|
|
|
|
|
<td>Local file: <code>benchmarks/sonnet.txt</code></td>
|
|
|
|
|
</tr>
|
|
|
|
|
<tr>
|
|
|
|
|
<td><strong>Random</strong></td>
|
|
|
|
|
<td style="text-align: center;">✅</td>
|
|
|
|
|
<td style="text-align: center;">✅</td>
|
|
|
|
|
<td><code>synthetic</code></td>
|
|
|
|
|
</tr>
|
|
|
|
|
<tr>
|
2025-03-31 00:38:58 -07:00
|
|
|
|
<td><strong>HuggingFace-VisionArena</strong></td>
|
|
|
|
|
<td style="text-align: center;">✅</td>
|
|
|
|
|
<td style="text-align: center;">✅</td>
|
|
|
|
|
<td><code>lmarena-ai/VisionArena-Chat</code></td>
|
|
|
|
|
</tr>
|
|
|
|
|
<tr>
|
|
|
|
|
<td><strong>HuggingFace-InstructCoder</strong></td>
|
|
|
|
|
<td style="text-align: center;">✅</td>
|
|
|
|
|
<td style="text-align: center;">✅</td>
|
|
|
|
|
<td><code>likaixin/InstructCoder</code></td>
|
2025-03-11 19:23:04 -07:00
|
|
|
|
</tr>
|
|
|
|
|
<tr>
|
2025-03-31 00:38:58 -07:00
|
|
|
|
<td><strong>HuggingFace-Other</strong></td>
|
2025-03-11 19:23:04 -07:00
|
|
|
|
<td style="text-align: center;">✅</td>
|
2025-03-13 21:07:54 -07:00
|
|
|
|
<td style="text-align: center;">✅</td>
|
2025-03-31 00:38:58 -07:00
|
|
|
|
<td><code>lmms-lab/LLaVA-OneVision-Data</code>, <code>Aeala/ShareGPT_Vicuna_unfiltered</code></td>
|
2025-03-11 19:23:04 -07:00
|
|
|
|
</tr>
|
|
|
|
|
</tbody>
|
|
|
|
|
</table>
|
2025-03-13 21:07:54 -07:00
|
|
|
|
|
|
|
|
|
✅: supported
|
|
|
|
|
|
2025-03-31 00:38:58 -07:00
|
|
|
|
🟡: Partial support
|
2025-03-11 19:23:04 -07:00
|
|
|
|
|
2025-03-31 00:38:58 -07:00
|
|
|
|
🚧: to be supported
|
2025-03-13 21:07:54 -07:00
|
|
|
|
|
2025-03-31 00:38:58 -07:00
|
|
|
|
**Note**: HuggingFace dataset's `dataset-name` should be set to `hf`
|
2025-03-11 19:23:04 -07:00
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
## Example - Online Benchmark
|
|
|
|
|
|
|
|
|
|
First start serving your model
|
2025-02-08 20:25:15 +08:00
|
|
|
|
|
2023-05-28 03:20:05 -07:00
|
|
|
|
```bash
|
2025-03-31 00:38:58 -07:00
|
|
|
|
vllm serve NousResearch/Hermes-3-Llama-3.1-8B --disable-log-requests
|
2023-05-28 03:20:05 -07:00
|
|
|
|
```
|
2024-11-05 11:30:02 -08:00
|
|
|
|
|
2025-03-11 19:23:04 -07:00
|
|
|
|
Then run the benchmarking script
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# download dataset
|
|
|
|
|
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
|
2025-03-31 00:38:58 -07:00
|
|
|
|
python3 vllm/benchmarks/benchmark_serving.py \
|
|
|
|
|
--backend vllm \
|
|
|
|
|
--model NousResearch/Hermes-3-Llama-3.1-8B \
|
|
|
|
|
--endpoint /v1/completions \
|
|
|
|
|
--dataset-name sharegpt \
|
|
|
|
|
--dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
|
|
|
|
|
--num-prompts 10
|
2025-03-11 19:23:04 -07:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
If successful, you will see the following output
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
============ Serving Benchmark Result ============
|
|
|
|
|
Successful requests: 10
|
|
|
|
|
Benchmark duration (s): 5.78
|
|
|
|
|
Total input tokens: 1369
|
|
|
|
|
Total generated tokens: 2212
|
|
|
|
|
Request throughput (req/s): 1.73
|
|
|
|
|
Output token throughput (tok/s): 382.89
|
|
|
|
|
Total Token throughput (tok/s): 619.85
|
|
|
|
|
---------------Time to First Token----------------
|
|
|
|
|
Mean TTFT (ms): 71.54
|
|
|
|
|
Median TTFT (ms): 73.88
|
|
|
|
|
P99 TTFT (ms): 79.49
|
|
|
|
|
-----Time per Output Token (excl. 1st token)------
|
|
|
|
|
Mean TPOT (ms): 7.91
|
|
|
|
|
Median TPOT (ms): 7.96
|
|
|
|
|
P99 TPOT (ms): 8.03
|
|
|
|
|
---------------Inter-token Latency----------------
|
|
|
|
|
Mean ITL (ms): 7.74
|
|
|
|
|
Median ITL (ms): 7.70
|
|
|
|
|
P99 ITL (ms): 8.39
|
|
|
|
|
==================================================
|
|
|
|
|
```
|
2024-11-05 11:30:02 -08:00
|
|
|
|
|
2025-03-11 19:23:04 -07:00
|
|
|
|
### VisionArena Benchmark for Vision Language Models
|
2025-02-08 20:25:15 +08:00
|
|
|
|
|
2024-11-05 11:30:02 -08:00
|
|
|
|
```bash
|
2025-03-11 19:23:04 -07:00
|
|
|
|
# need a model with vision capability here
|
|
|
|
|
vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests
|
2024-11-05 11:30:02 -08:00
|
|
|
|
```
|
2025-02-10 21:25:30 -08:00
|
|
|
|
|
2025-03-11 19:23:04 -07:00
|
|
|
|
```bash
|
2025-03-13 21:07:54 -07:00
|
|
|
|
python3 vllm/benchmarks/benchmark_serving.py \
|
2025-03-31 00:38:58 -07:00
|
|
|
|
--backend openai-chat \
|
|
|
|
|
--model Qwen/Qwen2-VL-7B-Instruct \
|
|
|
|
|
--endpoint /v1/chat/completions \
|
|
|
|
|
--dataset-name hf \
|
|
|
|
|
--dataset-path lmarena-ai/VisionArena-Chat \
|
|
|
|
|
--hf-split train \
|
|
|
|
|
--num-prompts 1000
|
2025-03-11 19:23:04 -07:00
|
|
|
|
```
|
2025-02-10 21:25:30 -08:00
|
|
|
|
|
2025-03-31 00:38:58 -07:00
|
|
|
|
### InstructCoder Benchmark with Speculative Decoding
|
2025-03-19 21:32:58 -07:00
|
|
|
|
|
2025-03-31 00:38:58 -07:00
|
|
|
|
``` bash
|
|
|
|
|
VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
|
|
|
|
|
--speculative-model "[ngram]" \
|
|
|
|
|
--ngram_prompt_lookup_min 2 \
|
|
|
|
|
--ngram-prompt-lookup-max 5 \
|
|
|
|
|
--num_speculative_tokens 5
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
``` bash
|
|
|
|
|
python3 benchmarks/benchmark_serving.py \
|
|
|
|
|
--model meta-llama/Meta-Llama-3-8B-Instruct \
|
|
|
|
|
--dataset-name hf \
|
|
|
|
|
--dataset-path likaixin/InstructCoder \
|
|
|
|
|
--num-prompts 2048
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Other HuggingFaceDataset Examples
|
2025-03-19 21:32:58 -07:00
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**`lmms-lab/LLaVA-OneVision-Data`**
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
python3 vllm/benchmarks/benchmark_serving.py \
|
2025-03-31 00:38:58 -07:00
|
|
|
|
--backend openai-chat \
|
|
|
|
|
--model Qwen/Qwen2-VL-7B-Instruct \
|
|
|
|
|
--endpoint /v1/chat/completions \
|
|
|
|
|
--dataset-name hf \
|
|
|
|
|
--dataset-path lmms-lab/LLaVA-OneVision-Data \
|
|
|
|
|
--hf-split train \
|
|
|
|
|
--hf-subset "chart2text(cauldron)" \
|
|
|
|
|
--num-prompts 10
|
2025-03-19 21:32:58 -07:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**`Aeala/ShareGPT_Vicuna_unfiltered`**
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
python3 vllm/benchmarks/benchmark_serving.py \
|
2025-03-31 00:38:58 -07:00
|
|
|
|
--backend openai-chat \
|
|
|
|
|
--model Qwen/Qwen2-VL-7B-Instruct \
|
|
|
|
|
--endpoint /v1/chat/completions \
|
|
|
|
|
--dataset-name hf \
|
|
|
|
|
--dataset-path Aeala/ShareGPT_Vicuna_unfiltered \
|
|
|
|
|
--hf-split train \
|
|
|
|
|
--num-prompts 10
|
2025-03-19 21:32:58 -07:00
|
|
|
|
```
|
|
|
|
|
|
2025-03-11 19:23:04 -07:00
|
|
|
|
---
|
|
|
|
|
## Example - Offline Throughput Benchmark
|
2025-02-10 21:25:30 -08:00
|
|
|
|
|
|
|
|
|
```bash
|
2025-03-13 21:07:54 -07:00
|
|
|
|
python3 vllm/benchmarks/benchmark_throughput.py \
|
2025-03-31 00:38:58 -07:00
|
|
|
|
--model NousResearch/Hermes-3-Llama-3.1-8B \
|
|
|
|
|
--dataset-name sonnet \
|
|
|
|
|
--dataset-path vllm/benchmarks/sonnet.txt \
|
|
|
|
|
--num-prompts 10
|
2025-03-13 21:07:54 -07:00
|
|
|
|
```
|
2025-03-11 19:23:04 -07:00
|
|
|
|
|
|
|
|
|
If successful, you will see the following output
|
|
|
|
|
|
|
|
|
|
```
|
2025-03-13 21:07:54 -07:00
|
|
|
|
Throughput: 7.15 requests/s, 4656.00 total tokens/s, 1072.15 output tokens/s
|
|
|
|
|
Total num prompt tokens: 5014
|
|
|
|
|
Total num output tokens: 1500
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### VisionArena Benchmark for Vision Language Models
|
|
|
|
|
|
|
|
|
|
``` bash
|
|
|
|
|
python3 vllm/benchmarks/benchmark_throughput.py \
|
2025-03-31 00:38:58 -07:00
|
|
|
|
--model Qwen/Qwen2-VL-7B-Instruct \
|
|
|
|
|
--backend vllm-chat \
|
|
|
|
|
--dataset-name hf \
|
|
|
|
|
--dataset-path lmarena-ai/VisionArena-Chat \
|
|
|
|
|
--num-prompts 1000 \
|
|
|
|
|
--hf-split train
|
2025-03-13 21:07:54 -07:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
The `num prompt tokens` now includes image token counts
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
Throughput: 2.55 requests/s, 4036.92 total tokens/s, 326.90 output tokens/s
|
|
|
|
|
Total num prompt tokens: 14527
|
|
|
|
|
Total num output tokens: 1280
|
2025-02-10 21:25:30 -08:00
|
|
|
|
```
|
2025-03-11 19:23:04 -07:00
|
|
|
|
|
2025-03-31 00:38:58 -07:00
|
|
|
|
### InstructCoder Benchmark with Speculative Decoding
|
|
|
|
|
|
|
|
|
|
``` bash
|
|
|
|
|
VLLM_WORKER_MULTIPROC_METHOD=spawn \
|
|
|
|
|
VLLM_USE_V1=1 \
|
|
|
|
|
python3 vllm/benchmarks/benchmark_throughput.py \
|
|
|
|
|
--dataset-name=hf \
|
|
|
|
|
--dataset-path=likaixin/InstructCoder \
|
|
|
|
|
--model=meta-llama/Meta-Llama-3-8B-Instruct \
|
|
|
|
|
--input-len=1000 \
|
|
|
|
|
--output-len=100 \
|
|
|
|
|
--num-prompts=2048 \
|
|
|
|
|
--async-engine \
|
|
|
|
|
--speculative-model="[ngram]" \
|
|
|
|
|
--ngram_prompt_lookup_min=2 \
|
|
|
|
|
--ngram-prompt-lookup-max=5 \
|
|
|
|
|
--num_speculative_tokens=5
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
Throughput: 104.77 requests/s, 23836.22 total tokens/s, 10477.10 output tokens/s
|
|
|
|
|
Total num prompt tokens: 261136
|
|
|
|
|
Total num output tokens: 204800
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Other HuggingFaceDataset Examples
|
|
|
|
|
|
|
|
|
|
**`lmms-lab/LLaVA-OneVision-Data`**
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
python3 vllm/benchmarks/benchmark_throughput.py \
|
|
|
|
|
--model Qwen/Qwen2-VL-7B-Instruct \
|
|
|
|
|
--backend vllm-chat \
|
|
|
|
|
--dataset-name hf \
|
|
|
|
|
--dataset-path lmms-lab/LLaVA-OneVision-Data \
|
|
|
|
|
--hf-split train \
|
|
|
|
|
--hf-subset "chart2text(cauldron)" \
|
|
|
|
|
--num-prompts 10
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**`Aeala/ShareGPT_Vicuna_unfiltered`**
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
python3 vllm/benchmarks/benchmark_throughput.py \
|
|
|
|
|
--model Qwen/Qwen2-VL-7B-Instruct \
|
|
|
|
|
--backend vllm-chat \
|
|
|
|
|
--dataset-name hf \
|
|
|
|
|
--dataset-path Aeala/ShareGPT_Vicuna_unfiltered \
|
|
|
|
|
--hf-split train \
|
|
|
|
|
--num-prompts 10
|
|
|
|
|
```
|
|
|
|
|
|
2025-03-11 19:23:04 -07:00
|
|
|
|
### Benchmark with LoRA Adapters
|
|
|
|
|
|
|
|
|
|
``` bash
|
2025-03-13 21:07:54 -07:00
|
|
|
|
# download dataset
|
|
|
|
|
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
|
|
|
|
|
python3 vllm/benchmarks/benchmark_throughput.py \
|
2025-03-31 00:38:58 -07:00
|
|
|
|
--model meta-llama/Llama-2-7b-hf \
|
|
|
|
|
--backend vllm \
|
|
|
|
|
--dataset_path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
|
|
|
|
|
--dataset_name sharegpt \
|
|
|
|
|
--num-prompts 10 \
|
|
|
|
|
--max-loras 2 \
|
|
|
|
|
--max-lora-rank 8 \
|
|
|
|
|
--enable-lora \
|
|
|
|
|
--lora-path yard1/llama-2-7b-sql-lora-test
|
2025-03-11 19:23:04 -07:00
|
|
|
|
```
|