20231088/vllm

History

[Doc] Update benchmarks README (#14646 )

Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com>
Co-authored-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>

2025-03-11 19:23:04 -07:00

cutlass_benchmarks

Update deprecated Python 3.8 typing (#13971 )

2025-03-02 17:34:51 -08:00

disagg_benchmarks

[Misc] Add SPDX-License-Identifier headers to python source files (#12628 )

2025-02-02 11:58:18 -08:00

fused_kernels

Update deprecated Python 3.8 typing (#13971 )

2025-03-02 17:34:51 -08:00

kernels

dynamic distpatch of fp8 kernels (#14245 )

2025-03-11 10:54:56 -04:00

overheads

[Misc] Add SPDX-License-Identifier headers to python source files (#12628 )

2025-02-02 11:58:18 -08:00

structured_schemas

benchmarks: simplify test jsonschema (#14567 )

2025-03-11 13:39:30 +00:00

backend_request_func.py

Deprecate best_of Sampling Parameter in anticipation for vLLM V1 (#13997 )

2025-03-05 20:22:43 +00:00

benchmark_dataset.py

[Feature] Consolidate performance benchmark datasets (#14036 )

2025-03-10 07:23:11 +00:00

benchmark_latency.py

[Benchmarks] Make detokenization optional in benchmark scripts (#11697 )

2025-03-07 08:09:00 -08:00

benchmark_long_document_qa_throughput.py

[Misc] Add SPDX-License-Identifier headers to python source files (#12628 )

2025-02-02 11:58:18 -08:00

benchmark_prefix_caching.py

[Benchmarks] Make detokenization optional in benchmark scripts (#11697 )

2025-03-07 08:09:00 -08:00

benchmark_prioritization.py

[Benchmarks] Make detokenization optional in benchmark scripts (#11697 )

2025-03-07 08:09:00 -08:00

benchmark_serving_structured_output.py

Fix typo in benchmark_serving_structured_output.py (#14566 )

2025-03-10 14:58:58 -07:00

benchmark_serving.py

[Feature] Consolidate performance benchmark datasets (#14036 )

2025-03-10 07:23:11 +00:00

benchmark_throughput.py

[Feature] Consolidate performance benchmark datasets (#14036 )

2025-03-10 07:23:11 +00:00

benchmark_utils.py

Update deprecated Python 3.8 typing (#13971 )

2025-03-02 17:34:51 -08:00

launch_tgi_server.sh

[CI/Build] Add shell script linting using shellcheck (#7925 )

2024-11-07 18:17:29 +00:00

README.md

[Doc] Update benchmarks README (#14646 )

2025-03-11 19:23:04 -07:00

run_structured_output_benchmark.sh

[V1][Core] Support for Structured Outputs (#12388 )

2025-03-07 07:19:11 -08:00

sonnet.txt

feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark (#3277 )

2024-03-27 13:39:26 -07:00

README.md

Benchmarking vLLM

This README guides you through running benchmark tests with the extensive datasets supported on vLLM. It’s a living document, updated as new features and datasets become available.

Dataset Overview

Dataset	Online	Offline	Data Path
ShareGPT	✅	✅	`wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`
BurstGPT	✅	✅	`wget https://github.com/HPMLL/BurstGPT/releases/download/v1.1/BurstGPT_without_fails_2.csv`
Sonnet	✅	✅	Local file: `benchmarks/sonnet.txt`
Random	✅	✅	`synthetic`
HuggingFace	✅	🚧	Specify your dataset path on HuggingFace
VisionArena	✅	🚧	`lmarena-ai/vision-arena-bench-v0.1` (a HuggingFace dataset)

✅: supported 🚧: to be supported

Note: VisionArena’s dataset-name should be set to hf

Example - Online Benchmark

First start serving your model

MODEL_NAME="NousResearch/Hermes-3-Llama-3.1-8B"
vllm serve ${MODEL_NAME} --disable-log-requests

Then run the benchmarking script

# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
MODEL_NAME="NousResearch/Hermes-3-Llama-3.1-8B"
NUM_PROMPTS=10
BACKEND="openai-chat"
DATASET_NAME="sharegpt"
DATASET_PATH="<your data path>/ShareGPT_V3_unfiltered_cleaned_split.json"
python3 benchmarks/benchmark_serving.py --backend ${BACKEND} --model ${MODEL_NAME} --endpoint /v1/chat/completions --dataset-name ${DATASET_NAME} --dataset-path ${DATASET_PATH} --num-prompts ${NUM_PROMPTS}

If successful, you will see the following output

============ Serving Benchmark Result ============
Successful requests:                     10        
Benchmark duration (s):                  5.78      
Total input tokens:                      1369      
Total generated tokens:                  2212      
Request throughput (req/s):              1.73      
Output token throughput (tok/s):         382.89    
Total Token throughput (tok/s):          619.85    
---------------Time to First Token----------------
Mean TTFT (ms):                          71.54     
Median TTFT (ms):                        73.88     
P99 TTFT (ms):                           79.49     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.91      
Median TPOT (ms):                        7.96      
P99 TPOT (ms):                           8.03      
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.74      
Median ITL (ms):                         7.70      
P99 ITL (ms):                            8.39      
==================================================

VisionArena Benchmark for Vision Language Models

# need a model with vision capability here
vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests

MODEL_NAME="Qwen/Qwen2-VL-7B-Instruct"
NUM_PROMPTS=10
BACKEND="openai-chat"
DATASET_NAME="hf"
DATASET_PATH="lmarena-ai/vision-arena-bench-v0.1"
DATASET_SPLIT='train'

python3 benchmarks/benchmark_serving.py \
  --backend "${BACKEND}" \
  --model "${MODEL_NAME}" \
  --endpoint "/v1/chat/completions" \
  --dataset-name "${DATASET_NAME}" \
  --dataset-path "${DATASET_PATH}" \
  --hf-split "${DATASET_SPLIT}" \
  --num-prompts "${NUM_PROMPTS}"

Example - Offline Throughput Benchmark

MODEL_NAME="NousResearch/Hermes-3-Llama-3.1-8B"
NUM_PROMPTS=10
DATASET_NAME="sonnet"
DATASET_PATH="benchmarks/sonnet.txt"

python3 benchmarks/benchmark_throughput.py \
  --model "${MODEL_NAME}" \
  --dataset-name "${DATASET_NAME}" \
  --dataset-path "${DATASET_PATH}" \
  --num-prompts "${NUM_PROMPTS}"

If successful, you will see the following output

Throughput: 7.35 requests/s, 4789.20 total tokens/s, 1102.83 output tokens/s

Benchmark with LoRA Adapters

MODEL_NAME="meta-llama/Llama-2-7b-hf"
BACKEND="vllm"
DATASET_NAME="sharegpt"
DATASET_PATH="/home/jovyan/data/vllm_benchmark_datasets/ShareGPT_V3_unfiltered_cleaned_split.json"
NUM_PROMPTS=10
MAX_LORAS=2
MAX_LORA_RANK=8
ENABLE_LORA="--enable-lora"
LORA_PATH="yard1/llama-2-7b-sql-lora-test"

python3 benchmarks/benchmark_throughput.py \
  --model "${MODEL_NAME}" \
  --backend "${BACKEND}" \
  --dataset_path "${DATASET_PATH}" \
  --dataset_name "${DATASET_NAME}" \
  --num-prompts "${NUM_PROMPTS}" \
  --max-loras "${MAX_LORAS}" \
  --max-lora-rank "${MAX_LORA_RANK}" \
  ${ENABLE_LORA} \
  --lora-path "${LORA_PATH}"

README.md Unescape Escape

Benchmarking vLLM

Dataset Overview

Example - Online Benchmark

VisionArena Benchmark for Vision Language Models

Example - Offline Throughput Benchmark

Benchmark with LoRA Adapters

README.md