vllm/benchmarks
Chenyaaang d544d141ec
update benchmark_serving_structured_output to include auto backend (#16438)
Signed-off-by: Chenyaaang <chenyangli@google.com>
2025-04-11 12:25:52 +08:00
..

Benchmarking vLLM

This README guides you through running benchmark tests with the extensive datasets supported on vLLM. Its a living document, updated as new features and datasets become available.

Dataset Overview

Dataset Online Offline Data Path
ShareGPT wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
BurstGPT wget https://github.com/HPMLL/BurstGPT/releases/download/v1.1/BurstGPT_without_fails_2.csv
Sonnet Local file: benchmarks/sonnet.txt
Random synthetic
HuggingFace-VisionArena lmarena-ai/VisionArena-Chat
HuggingFace-InstructCoder likaixin/InstructCoder
HuggingFace-AIMO AI-MO/aimo-validation-aime , AI-MO/NuminaMath-1.5, AI-MO/NuminaMath-CoT
HuggingFace-Other lmms-lab/LLaVA-OneVision-Data, Aeala/ShareGPT_Vicuna_unfiltered

: supported

🟡: Partial support

🚧: to be supported

Note: HuggingFace dataset's dataset-name should be set to hf


Example - Online Benchmark

First start serving your model

vllm serve NousResearch/Hermes-3-Llama-3.1-8B --disable-log-requests

Then run the benchmarking script

# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
python3 vllm/benchmarks/benchmark_serving.py \
  --backend vllm \
  --model NousResearch/Hermes-3-Llama-3.1-8B \
  --endpoint /v1/completions \
  --dataset-name sharegpt \
  --dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 10

If successful, you will see the following output

============ Serving Benchmark Result ============
Successful requests:                     10        
Benchmark duration (s):                  5.78      
Total input tokens:                      1369      
Total generated tokens:                  2212      
Request throughput (req/s):              1.73      
Output token throughput (tok/s):         382.89    
Total Token throughput (tok/s):          619.85    
---------------Time to First Token----------------
Mean TTFT (ms):                          71.54     
Median TTFT (ms):                        73.88     
P99 TTFT (ms):                           79.49     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.91      
Median TPOT (ms):                        7.96      
P99 TPOT (ms):                           8.03      
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.74      
Median ITL (ms):                         7.70      
P99 ITL (ms):                            8.39      
==================================================

VisionArena Benchmark for Vision Language Models

# need a model with vision capability here
vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests
python3 vllm/benchmarks/benchmark_serving.py \
  --backend openai-chat \
  --model Qwen/Qwen2-VL-7B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path lmarena-ai/VisionArena-Chat \
  --hf-split train \
  --num-prompts 1000

InstructCoder Benchmark with Speculative Decoding

VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
    --speculative-model "[ngram]" \
    --ngram_prompt_lookup_min 2 \
    --ngram-prompt-lookup-max 5 \
    --num_speculative_tokens 5
python3 benchmarks/benchmark_serving.py \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --dataset-name hf \
    --dataset-path likaixin/InstructCoder \
    --num-prompts 2048

Other HuggingFaceDataset Examples

vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests

lmms-lab/LLaVA-OneVision-Data

python3 vllm/benchmarks/benchmark_serving.py \
  --backend openai-chat \
  --model Qwen/Qwen2-VL-7B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path lmms-lab/LLaVA-OneVision-Data \
  --hf-split train \
  --hf-subset "chart2text(cauldron)" \
  --num-prompts 10

Aeala/ShareGPT_Vicuna_unfiltered

python3 vllm/benchmarks/benchmark_serving.py \
  --backend openai-chat \
  --model Qwen/Qwen2-VL-7B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path Aeala/ShareGPT_Vicuna_unfiltered \
  --hf-split train \
  --num-prompts 10

AI-MO/aimo-validation-aime

python3 vllm/benchmarks/benchmark_serving.py \
    --model Qwen/QwQ-32B \
    --dataset-name hf \
    --dataset-path AI-MO/aimo-validation-aime \
    --num-prompts 10 \
    --seed 42

Running With Sampling Parameters

When using OpenAI-compatible backends such as vllm, optional sampling parameters can be specified. Example client command:

python3 vllm/benchmarks/benchmark_serving.py \
  --backend vllm \
  --model NousResearch/Hermes-3-Llama-3.1-8B \
  --endpoint /v1/completions \
  --dataset-name sharegpt \
  --dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
  --top-k 10 \
  --top-p 0.9 \
  --temperature 0.5 \
  --num-prompts 10

Example - Offline Throughput Benchmark

python3 vllm/benchmarks/benchmark_throughput.py \
  --model NousResearch/Hermes-3-Llama-3.1-8B \
  --dataset-name sonnet \
  --dataset-path vllm/benchmarks/sonnet.txt \
  --num-prompts 10

If successful, you will see the following output

Throughput: 7.15 requests/s, 4656.00 total tokens/s, 1072.15 output tokens/s
Total num prompt tokens:  5014
Total num output tokens:  1500

VisionArena Benchmark for Vision Language Models

python3 vllm/benchmarks/benchmark_throughput.py \
  --model Qwen/Qwen2-VL-7B-Instruct \
  --backend vllm-chat \
  --dataset-name hf \
  --dataset-path lmarena-ai/VisionArena-Chat \
  --num-prompts 1000 \
  --hf-split train

The num prompt tokens now includes image token counts

Throughput: 2.55 requests/s, 4036.92 total tokens/s, 326.90 output tokens/s
Total num prompt tokens:  14527
Total num output tokens:  1280

InstructCoder Benchmark with Speculative Decoding

VLLM_WORKER_MULTIPROC_METHOD=spawn \
VLLM_USE_V1=1 \
python3 vllm/benchmarks/benchmark_throughput.py \
    --dataset-name=hf \
    --dataset-path=likaixin/InstructCoder \
    --model=meta-llama/Meta-Llama-3-8B-Instruct \
    --input-len=1000 \
    --output-len=100 \
    --num-prompts=2048 \
    --async-engine \
    --speculative-model="[ngram]" \
    --ngram_prompt_lookup_min=2 \
    --ngram-prompt-lookup-max=5 \
    --num_speculative_tokens=5
Throughput: 104.77 requests/s, 23836.22 total tokens/s, 10477.10 output tokens/s
Total num prompt tokens:  261136
Total num output tokens:  204800

Other HuggingFaceDataset Examples

lmms-lab/LLaVA-OneVision-Data

python3 vllm/benchmarks/benchmark_throughput.py \
  --model Qwen/Qwen2-VL-7B-Instruct \
  --backend vllm-chat \
  --dataset-name hf \
  --dataset-path lmms-lab/LLaVA-OneVision-Data \
  --hf-split train \
  --hf-subset "chart2text(cauldron)" \
  --num-prompts 10

Aeala/ShareGPT_Vicuna_unfiltered

python3 vllm/benchmarks/benchmark_throughput.py \
  --model Qwen/Qwen2-VL-7B-Instruct \
  --backend vllm-chat \
  --dataset-name hf \
  --dataset-path Aeala/ShareGPT_Vicuna_unfiltered \
  --hf-split train \
  --num-prompts 10

AI-MO/aimo-validation-aime

python3 benchmarks/benchmark_throughput.py \
  --model Qwen/QwQ-32B \
  --backend vllm \
  --dataset-name hf \
  --dataset-path AI-MO/aimo-validation-aime \
  --hf-split train \
  --num-prompts 10

Benchmark with LoRA Adapters

# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
python3 vllm/benchmarks/benchmark_throughput.py \
  --model meta-llama/Llama-2-7b-hf \
  --backend vllm \
  --dataset_path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
  --dataset_name sharegpt \
  --num-prompts 10 \
  --max-loras 2 \
  --max-lora-rank 8 \
  --enable-lora \
  --lora-path yard1/llama-2-7b-sql-lora-test