20231088/vllm

History

update benchmark_serving_structured_output to include auto backend (#16438 )

Signed-off-by: Chenyaaang <chenyangli@google.com>

2025-04-11 12:25:52 +08:00

cutlass_benchmarks

Update deprecated Python 3.8 typing (#13971 )

2025-03-02 17:34:51 -08:00

disagg_benchmarks

[Misc] Add SPDX-License-Identifier headers to python source files (#12628 )

2025-02-02 11:58:18 -08:00

fused_kernels

Update deprecated Python 3.8 typing (#13971 )

2025-03-02 17:34:51 -08:00

kernels

Upstream Llama4 Support to Main (#16113 )

2025-04-07 08:06:27 -07:00

overheads

[Misc] Add SPDX-License-Identifier headers to python source files (#12628 )

2025-02-02 11:58:18 -08:00

structured_schemas

benchmarks: simplify test jsonschema (#14567 )

2025-03-11 13:39:30 +00:00

backend_request_func.py

[Benchmark] Add sampling parameters to benchmark_serving. (#16022 )

2025-04-06 12:30:35 +08:00

benchmark_dataset.py

check input length of sonnet samples (#16423 )

2025-04-11 10:15:06 +08:00

benchmark_latency.py

[Benchmarks] Make detokenization optional in benchmark scripts (#11697 )

2025-03-07 08:09:00 -08:00

benchmark_long_document_qa_throughput.py

[Misc] Add SPDX-License-Identifier headers to python source files (#12628 )

2025-02-02 11:58:18 -08:00

benchmark_prefix_caching.py

[Benchmarks] Make detokenization optional in benchmark scripts (#11697 )

2025-03-07 08:09:00 -08:00

benchmark_prioritization.py

[Benchmarks] Make detokenization optional in benchmark scripts (#11697 )

2025-03-07 08:09:00 -08:00

benchmark_serving_structured_output.py

update benchmark_serving_structured_output to include auto backend (#16438 )

2025-04-11 12:25:52 +08:00

benchmark_serving.py

Fix range_ratio Bug in RandomDataset (#16126 )

2025-04-10 15:31:17 -07:00

benchmark_throughput.py

Fix range_ratio Bug in RandomDataset (#16126 )

2025-04-10 15:31:17 -07:00

benchmark_utils.py

Update deprecated Python 3.8 typing (#13971 )

2025-03-02 17:34:51 -08:00

README.md

[Benchmark] Add sampling parameters to benchmark_serving. (#16022 )

2025-04-06 12:30:35 +08:00

run_structured_output_benchmark.sh

[Misc][Benchmark] Add support for different tokenizer_mode (#15040 )

2025-03-19 14:56:50 +00:00

sonnet.txt

feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark (#3277 )

2024-03-27 13:39:26 -07:00

README.md

Benchmarking vLLM

This README guides you through running benchmark tests with the extensive datasets supported on vLLM. It’s a living document, updated as new features and datasets become available.

Dataset Overview

Dataset	Online	Offline	Data Path
ShareGPT	✅	✅	`wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`
BurstGPT	✅	✅	`wget https://github.com/HPMLL/BurstGPT/releases/download/v1.1/BurstGPT_without_fails_2.csv`
Sonnet	✅	✅	Local file: `benchmarks/sonnet.txt`
Random	✅	✅	`synthetic`
HuggingFace-VisionArena	✅	✅	`lmarena-ai/VisionArena-Chat`
HuggingFace-InstructCoder	✅	✅	`likaixin/InstructCoder`
HuggingFace-AIMO	✅	✅	`AI-MO/aimo-validation-aime` , `AI-MO/NuminaMath-1.5`, `AI-MO/NuminaMath-CoT`
HuggingFace-Other	✅	✅	`lmms-lab/LLaVA-OneVision-Data`, `Aeala/ShareGPT_Vicuna_unfiltered`

✅: supported

🟡: Partial support

🚧: to be supported

Note: HuggingFace dataset's dataset-name should be set to hf

Example - Online Benchmark

First start serving your model

vllm serve NousResearch/Hermes-3-Llama-3.1-8B --disable-log-requests

Then run the benchmarking script

# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
python3 vllm/benchmarks/benchmark_serving.py \
  --backend vllm \
  --model NousResearch/Hermes-3-Llama-3.1-8B \
  --endpoint /v1/completions \
  --dataset-name sharegpt \
  --dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 10

If successful, you will see the following output

============ Serving Benchmark Result ============
Successful requests:                     10        
Benchmark duration (s):                  5.78      
Total input tokens:                      1369      
Total generated tokens:                  2212      
Request throughput (req/s):              1.73      
Output token throughput (tok/s):         382.89    
Total Token throughput (tok/s):          619.85    
---------------Time to First Token----------------
Mean TTFT (ms):                          71.54     
Median TTFT (ms):                        73.88     
P99 TTFT (ms):                           79.49     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.91      
Median TPOT (ms):                        7.96      
P99 TPOT (ms):                           8.03      
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.74      
Median ITL (ms):                         7.70      
P99 ITL (ms):                            8.39      
==================================================

VisionArena Benchmark for Vision Language Models

# need a model with vision capability here
vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests

python3 vllm/benchmarks/benchmark_serving.py \
  --backend openai-chat \
  --model Qwen/Qwen2-VL-7B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path lmarena-ai/VisionArena-Chat \
  --hf-split train \
  --num-prompts 1000

InstructCoder Benchmark with Speculative Decoding

VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
    --speculative-model "[ngram]" \
    --ngram_prompt_lookup_min 2 \
    --ngram-prompt-lookup-max 5 \
    --num_speculative_tokens 5

python3 benchmarks/benchmark_serving.py \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --dataset-name hf \
    --dataset-path likaixin/InstructCoder \
    --num-prompts 2048

Other HuggingFaceDataset Examples

vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests

lmms-lab/LLaVA-OneVision-Data

python3 vllm/benchmarks/benchmark_serving.py \
  --backend openai-chat \
  --model Qwen/Qwen2-VL-7B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path lmms-lab/LLaVA-OneVision-Data \
  --hf-split train \
  --hf-subset "chart2text(cauldron)" \
  --num-prompts 10

Aeala/ShareGPT_Vicuna_unfiltered

python3 vllm/benchmarks/benchmark_serving.py \
  --backend openai-chat \
  --model Qwen/Qwen2-VL-7B-Instruct \
  --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path Aeala/ShareGPT_Vicuna_unfiltered \
  --hf-split train \
  --num-prompts 10

AI-MO/aimo-validation-aime

python3 vllm/benchmarks/benchmark_serving.py \
    --model Qwen/QwQ-32B \
    --dataset-name hf \
    --dataset-path AI-MO/aimo-validation-aime \
    --num-prompts 10 \
    --seed 42

Running With Sampling Parameters

When using OpenAI-compatible backends such as vllm, optional sampling parameters can be specified. Example client command:

python3 vllm/benchmarks/benchmark_serving.py \
  --backend vllm \
  --model NousResearch/Hermes-3-Llama-3.1-8B \
  --endpoint /v1/completions \
  --dataset-name sharegpt \
  --dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
  --top-k 10 \
  --top-p 0.9 \
  --temperature 0.5 \
  --num-prompts 10

Example - Offline Throughput Benchmark

python3 vllm/benchmarks/benchmark_throughput.py \
  --model NousResearch/Hermes-3-Llama-3.1-8B \
  --dataset-name sonnet \
  --dataset-path vllm/benchmarks/sonnet.txt \
  --num-prompts 10

If successful, you will see the following output

Throughput: 7.15 requests/s, 4656.00 total tokens/s, 1072.15 output tokens/s
Total num prompt tokens:  5014
Total num output tokens:  1500

VisionArena Benchmark for Vision Language Models

python3 vllm/benchmarks/benchmark_throughput.py \
  --model Qwen/Qwen2-VL-7B-Instruct \
  --backend vllm-chat \
  --dataset-name hf \
  --dataset-path lmarena-ai/VisionArena-Chat \
  --num-prompts 1000 \
  --hf-split train

The num prompt tokens now includes image token counts

Throughput: 2.55 requests/s, 4036.92 total tokens/s, 326.90 output tokens/s
Total num prompt tokens:  14527
Total num output tokens:  1280

InstructCoder Benchmark with Speculative Decoding

VLLM_WORKER_MULTIPROC_METHOD=spawn \
VLLM_USE_V1=1 \
python3 vllm/benchmarks/benchmark_throughput.py \
    --dataset-name=hf \
    --dataset-path=likaixin/InstructCoder \
    --model=meta-llama/Meta-Llama-3-8B-Instruct \
    --input-len=1000 \
    --output-len=100 \
    --num-prompts=2048 \
    --async-engine \
    --speculative-model="[ngram]" \
    --ngram_prompt_lookup_min=2 \
    --ngram-prompt-lookup-max=5 \
    --num_speculative_tokens=5

Throughput: 104.77 requests/s, 23836.22 total tokens/s, 10477.10 output tokens/s
Total num prompt tokens:  261136
Total num output tokens:  204800

Other HuggingFaceDataset Examples

lmms-lab/LLaVA-OneVision-Data

python3 vllm/benchmarks/benchmark_throughput.py \
  --model Qwen/Qwen2-VL-7B-Instruct \
  --backend vllm-chat \
  --dataset-name hf \
  --dataset-path lmms-lab/LLaVA-OneVision-Data \
  --hf-split train \
  --hf-subset "chart2text(cauldron)" \
  --num-prompts 10

Aeala/ShareGPT_Vicuna_unfiltered

python3 vllm/benchmarks/benchmark_throughput.py \
  --model Qwen/Qwen2-VL-7B-Instruct \
  --backend vllm-chat \
  --dataset-name hf \
  --dataset-path Aeala/ShareGPT_Vicuna_unfiltered \
  --hf-split train \
  --num-prompts 10

AI-MO/aimo-validation-aime

python3 benchmarks/benchmark_throughput.py \
  --model Qwen/QwQ-32B \
  --backend vllm \
  --dataset-name hf \
  --dataset-path AI-MO/aimo-validation-aime \
  --hf-split train \
  --num-prompts 10

Benchmark with LoRA Adapters

# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
python3 vllm/benchmarks/benchmark_throughput.py \
  --model meta-llama/Llama-2-7b-hf \
  --backend vllm \
  --dataset_path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
  --dataset_name sharegpt \
  --num-prompts 10 \
  --max-loras 2 \
  --max-lora-rank 8 \
  --enable-lora \
  --lora-path yard1/llama-2-7b-sql-lora-test

README.md Unescape Escape

Benchmarking vLLM

Dataset Overview

Example - Online Benchmark

VisionArena Benchmark for Vision Language Models

InstructCoder Benchmark with Speculative Decoding

Other HuggingFaceDataset Examples

Running With Sampling Parameters

Example - Offline Throughput Benchmark

VisionArena Benchmark for Vision Language Models

InstructCoder Benchmark with Speculative Decoding

Other HuggingFaceDataset Examples

Benchmark with LoRA Adapters

README.md