20231088/vllm

History

[Benchmark] Allow oversample request in benchmark dataset (#15170 )

Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com>

2025-03-20 12:32:58 +08:00

cutlass_benchmarks

Update deprecated Python 3.8 typing (#13971 )

2025-03-02 17:34:51 -08:00

disagg_benchmarks

[Misc] Add SPDX-License-Identifier headers to python source files (#12628 )

2025-02-02 11:58:18 -08:00

fused_kernels

Update deprecated Python 3.8 typing (#13971 )

2025-03-02 17:34:51 -08:00

kernels

[Kernels] LoRA - Retire SGMV and BGMV Kernels (#14685 )

2025-03-18 09:47:53 +00:00

overheads

[Misc] Add SPDX-License-Identifier headers to python source files (#12628 )

2025-02-02 11:58:18 -08:00

structured_schemas

benchmarks: simplify test jsonschema (#14567 )

2025-03-11 13:39:30 +00:00

backend_request_func.py

fix "Total generated tokens:" is 0 if using --backend tgi and --endpo… (#14673 )

2025-03-19 20:56:16 -07:00

benchmark_dataset.py

[Benchmark] Allow oversample request in benchmark dataset (#15170 )

2025-03-20 12:32:58 +08:00

benchmark_latency.py

[Benchmarks] Make detokenization optional in benchmark scripts (#11697 )

2025-03-07 08:09:00 -08:00

benchmark_long_document_qa_throughput.py

[Misc] Add SPDX-License-Identifier headers to python source files (#12628 )

2025-02-02 11:58:18 -08:00

benchmark_prefix_caching.py

[Benchmarks] Make detokenization optional in benchmark scripts (#11697 )

2025-03-07 08:09:00 -08:00

benchmark_prioritization.py

[Benchmarks] Make detokenization optional in benchmark scripts (#11697 )

2025-03-07 08:09:00 -08:00

benchmark_serving_structured_output.py

[Misc][Benchmark] Add support for different tokenizer_mode (#15040 )

2025-03-19 14:56:50 +00:00

benchmark_serving.py

[Benchmark] Do not save detailed info to json by default (#14879 )

2025-03-16 21:48:11 -07:00

benchmark_throughput.py

[Feature] Add visionarena offline support for benchmark_throughput (#14654 )

2025-03-14 04:07:54 +00:00

benchmark_utils.py

Update deprecated Python 3.8 typing (#13971 )

2025-03-02 17:34:51 -08:00

launch_tgi_server.sh

[CI/Build] Add shell script linting using shellcheck (#7925 )

2024-11-07 18:17:29 +00:00

README.md

[Benchmark] Allow oversample request in benchmark dataset (#15170 )

2025-03-20 12:32:58 +08:00

run_structured_output_benchmark.sh

[Misc][Benchmark] Add support for different tokenizer_mode (#15040 )

2025-03-19 14:56:50 +00:00

sonnet.txt

feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark (#3277 )

2024-03-27 13:39:26 -07:00

README.md

Benchmarking vLLM

This README guides you through running benchmark tests with the extensive datasets supported on vLLM. It’s a living document, updated as new features and datasets become available.

Dataset Overview

Dataset	Online	Offline	Data Path
ShareGPT	✅	✅	`wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`
BurstGPT	✅	✅	`wget https://github.com/HPMLL/BurstGPT/releases/download/v1.1/BurstGPT_without_fails_2.csv`
Sonnet	✅	✅	Local file: `benchmarks/sonnet.txt`
Random	✅	✅	`synthetic`
HuggingFace	🟡	🟡	Specify your dataset path on HuggingFace
VisionArena	✅	✅	`lmarena-ai/vision-arena-bench-v0.1` (a HuggingFace dataset)

✅: supported

🚧: to be supported

🟡: Partial support. Currently, HuggingFaceDataset only supports dataset formats similar to lmms-lab/LLaVA-OneVision-Data and Aeala/ShareGPT_Vicuna_unfiltered. If you need support for other dataset formats, please consider contributing.

Note: VisionArena’s dataset-name should be set to hf

Example - Online Benchmark

First start serving your model

MODEL_NAME="NousResearch/Hermes-3-Llama-3.1-8B"
vllm serve ${MODEL_NAME} --disable-log-requests

Then run the benchmarking script

# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
MODEL_NAME="NousResearch/Hermes-3-Llama-3.1-8B"
NUM_PROMPTS=10
BACKEND="vllm"
DATASET_NAME="sharegpt"
DATASET_PATH="<your data path>/ShareGPT_V3_unfiltered_cleaned_split.json"
python3 vllm/benchmarks/benchmark_serving.py --backend ${BACKEND} --model ${MODEL_NAME} --endpoint /v1/completions --dataset-name ${DATASET_NAME} --dataset-path ${DATASET_PATH} --num-prompts ${NUM_PROMPTS}

If successful, you will see the following output

============ Serving Benchmark Result ============
Successful requests:                     10        
Benchmark duration (s):                  5.78      
Total input tokens:                      1369      
Total generated tokens:                  2212      
Request throughput (req/s):              1.73      
Output token throughput (tok/s):         382.89    
Total Token throughput (tok/s):          619.85    
---------------Time to First Token----------------
Mean TTFT (ms):                          71.54     
Median TTFT (ms):                        73.88     
P99 TTFT (ms):                           79.49     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.91      
Median TPOT (ms):                        7.96      
P99 TPOT (ms):                           8.03      
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.74      
Median ITL (ms):                         7.70      
P99 ITL (ms):                            8.39      
==================================================

VisionArena Benchmark for Vision Language Models

# need a model with vision capability here
vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests

MODEL_NAME="Qwen/Qwen2-VL-7B-Instruct"
NUM_PROMPTS=10
BACKEND="openai-chat"
DATASET_NAME="hf"
DATASET_PATH="lmarena-ai/vision-arena-bench-v0.1"
DATASET_SPLIT='train'

python3 vllm/benchmarks/benchmark_serving.py \
  --backend "${BACKEND}" \
  --model "${MODEL_NAME}" \
  --endpoint "/v1/chat/completions" \
  --dataset-name "${DATASET_NAME}" \
  --dataset-path "${DATASET_PATH}" \
  --hf-split "${DATASET_SPLIT}" \
  --num-prompts "${NUM_PROMPTS}"

HuggingFaceDataset Examples

Currently, HuggingFaceDataset only supports dataset formats similar to lmms-lab/LLaVA-OneVision-Data and Aeala/ShareGPT_Vicuna_unfiltered. If you need support for other dataset formats, please consider contributing.

# need a model with vision capability here
vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests

lmms-lab/LLaVA-OneVision-Data

MODEL_NAME="Qwen/Qwen2-VL-7B-Instruct"
NUM_PROMPTS=10
BACKEND="openai-chat"
DATASET_NAME="hf"
DATASET_PATH="lmms-lab/LLaVA-OneVision-Data"
DATASET_SPLIT='train'
DATASET_SUBSET='chart2text(cauldron)'
python3 vllm/benchmarks/benchmark_serving.py \
  --backend "${BACKEND}" \
  --model "${MODEL_NAME}" \
  --endpoint "/v1/chat/completions" \
  --dataset-name "${DATASET_NAME}" \
  --dataset-path "${DATASET_PATH}" \
  --hf-split "${DATASET_SPLIT}" \
  --num-prompts "${NUM_PROMPTS}" \
  --hf-subset "${DATASET_SUBSET}"

Aeala/ShareGPT_Vicuna_unfiltered

MODEL_NAME="Qwen/Qwen2-VL-7B-Instruct"
NUM_PROMPTS=10
BACKEND="openai-chat"
DATASET_NAME="hf"
DATASET_PATH="Aeala/ShareGPT_Vicuna_unfiltered"
DATASET_SPLIT='train'
python3 vllm/benchmarks/benchmark_serving.py \
  --backend "${BACKEND}" \
  --model "${MODEL_NAME}" \
  --endpoint "/v1/chat/completions" \
  --dataset-name "${DATASET_NAME}" \
  --dataset-path "${DATASET_PATH}" \
  --hf-split "${DATASET_SPLIT}" \
  --num-prompts "${NUM_PROMPTS}" \

Example - Offline Throughput Benchmark

MODEL_NAME="NousResearch/Hermes-3-Llama-3.1-8B"
NUM_PROMPTS=10
DATASET_NAME="sonnet"
DATASET_PATH="vllm/benchmarks/sonnet.txt"

python3 vllm/benchmarks/benchmark_throughput.py \
  --model "${MODEL_NAME}" \
  --dataset-name "${DATASET_NAME}" \
  --dataset-path "${DATASET_PATH}" \
  --num-prompts "${NUM_PROMPTS}"

If successful, you will see the following output

Throughput: 7.15 requests/s, 4656.00 total tokens/s, 1072.15 output tokens/s
Total num prompt tokens:  5014
Total num output tokens:  1500

VisionArena Benchmark for Vision Language Models

MODEL_NAME="Qwen/Qwen2-VL-7B-Instruct"
NUM_PROMPTS=10
DATASET_NAME="hf"
DATASET_PATH="lmarena-ai/vision-arena-bench-v0.1"
DATASET_SPLIT="train"

python3 vllm/benchmarks/benchmark_throughput.py \
  --model "${MODEL_NAME}" \
  --backend "vllm-chat" \
  --dataset-name "${DATASET_NAME}" \
  --dataset-path "${DATASET_PATH}" \
  --num-prompts "${NUM_PROMPTS}" \
  --hf-split "${DATASET_SPLIT}"

The num prompt tokens now includes image token counts

Throughput: 2.55 requests/s, 4036.92 total tokens/s, 326.90 output tokens/s
Total num prompt tokens:  14527
Total num output tokens:  1280

Benchmark with LoRA Adapters

# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
MODEL_NAME="meta-llama/Llama-2-7b-hf"
BACKEND="vllm"
DATASET_NAME="sharegpt"
DATASET_PATH="<your data path>/ShareGPT_V3_unfiltered_cleaned_split.json"
NUM_PROMPTS=10
MAX_LORAS=2
MAX_LORA_RANK=8
ENABLE_LORA="--enable-lora"
LORA_PATH="yard1/llama-2-7b-sql-lora-test"

python3 vllm/benchmarks/benchmark_throughput.py \
  --model "${MODEL_NAME}" \
  --backend "${BACKEND}" \
  --dataset_path "${DATASET_PATH}" \
  --dataset_name "${DATASET_NAME}" \
  --num-prompts "${NUM_PROMPTS}" \
  --max-loras "${MAX_LORAS}" \
  --max-lora-rank "${MAX_LORA_RANK}" \
  ${ENABLE_LORA} \
  --lora-path "${LORA_PATH}"

README.md Unescape Escape

Benchmarking vLLM

Dataset Overview

Example - Online Benchmark

VisionArena Benchmark for Vision Language Models

HuggingFaceDataset Examples

Example - Offline Throughput Benchmark

VisionArena Benchmark for Vision Language Models

Benchmark with LoRA Adapters

README.md