[CI] Update performance benchmark: upgrade trt-llm to r24.07, and add SGLang (#7412)
This commit is contained in:
parent
05d686432f
commit
fbb74420e7
28
.buildkite/nightly-benchmarks/nightly-annotation.md
Normal file
28
.buildkite/nightly-benchmarks/nightly-annotation.md
Normal file
@ -0,0 +1,28 @@
|
|||||||
|
|
||||||
|
## Description
|
||||||
|
|
||||||
|
This file contains the downloading link for benchmarking results.
|
||||||
|
|
||||||
|
- [benchmarking pipeline](artifact://nightly-pipeline.yaml)
|
||||||
|
- [benchmarking results](artifact://results.zip)
|
||||||
|
- [benchmarking code](artifact://nightly-benchmarks.zip)
|
||||||
|
|
||||||
|
Please download the visualization scripts in the post
|
||||||
|
|
||||||
|
|
||||||
|
## Results reproduction
|
||||||
|
|
||||||
|
- Find the docker we use in `benchmarking pipeline`
|
||||||
|
- Deploy the docker, and inside the docker:
|
||||||
|
- Download `nightly-benchmarks.zip`.
|
||||||
|
- In the same folder, run the following code
|
||||||
|
```
|
||||||
|
export HF_TOKEN=<your HF token>
|
||||||
|
apt update
|
||||||
|
apt install -y git
|
||||||
|
unzip nightly-benchmarks.zip
|
||||||
|
VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
And the results will be inside `./benchmarks/results`.
|
||||||
|
|
@ -1,45 +1,39 @@
|
|||||||
|
|
||||||
# Nightly benchmark
|
# Nightly benchmark
|
||||||
|
|
||||||
The main goal of this benchmarking is two-fold:
|
This benchmark aims to:
|
||||||
- Performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and tgi) leads in performance in what workload.
|
- Provide performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and SGLang) leads in performance in what workload.
|
||||||
- Reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions in [reproduce.md]().
|
- Be reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions.
|
||||||
|
|
||||||
|
Latest results: [results link](https://blog.vllm.ai/2024/09/05/perf-update.html), scroll to the end.
|
||||||
|
|
||||||
|
Latest reproduction guilde: [github issue link](https://github.com/vllm-project/vllm/issues/8176)
|
||||||
|
|
||||||
|
|
||||||
## Docker images
|
## Setup
|
||||||
|
|
||||||
We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following docker images:
|
- Docker images:
|
||||||
- vllm/vllm-openai:v0.5.0.post1
|
- vLLM: `vllm/vllm-openai:v0.6.2`
|
||||||
- nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
|
- SGLang: `lmsysorg/sglang:v0.3.2-cu121`
|
||||||
- openmmlab/lmdeploy:v0.5.0
|
- LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12`
|
||||||
- ghcr.io/huggingface/text-generation-inference:2.1
|
- TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3`
|
||||||
|
- *NOTE: we uses r24.07 as the current implementation only works for this version. We are going to bump this up.*
|
||||||
|
- Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark.
|
||||||
|
- Hardware
|
||||||
|
- 8x Nvidia A100 GPUs
|
||||||
|
- Workload:
|
||||||
|
- Dataset
|
||||||
|
- ShareGPT dataset
|
||||||
|
- Prefill-heavy dataset (in average 462 input tokens, 16 tokens as output)
|
||||||
|
- Decode-heavy dataset (in average 462 input tokens, 256 output tokens)
|
||||||
|
- Check [nightly-tests.json](tests/nightly-tests.json) for the concrete configuration of datasets we use.
|
||||||
|
- Models: llama-3 8B, llama-3 70B.
|
||||||
|
- We do not use llama 3.1 as it is incompatible with trt-llm r24.07. ([issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105)).
|
||||||
|
- Average QPS (query per second): 2, 4, 8, 16, 32 and inf.
|
||||||
|
- Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed.
|
||||||
|
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
|
||||||
|
|
||||||
<!-- Please check <a href="artifact://workspace/build/buildkite/vllm/performance-benchmark/.buildkite/nightly-benchmarks/nightly-pipeline.yaml">nightly-pipeline.yaml</a> artifact for more details on how we deploy the docker images. -->
|
# Known issues
|
||||||
|
|
||||||
|
- TRT-LLM crashes with Llama 3.1 8B [issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105).
|
||||||
## Hardware
|
- TGI does not support `ignore-eos` flag.
|
||||||
|
|
||||||
One AWS node with 8x NVIDIA A100 GPUs.
|
|
||||||
|
|
||||||
|
|
||||||
## Workload description
|
|
||||||
|
|
||||||
We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following workload:
|
|
||||||
|
|
||||||
- Input length: randomly sample 500 prompts from ShareGPT dataset (with fixed random seed).
|
|
||||||
- Output length: the corresponding output length of these 500 prompts.
|
|
||||||
- Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
|
|
||||||
- Average QPS (query per second): 4 for the small model (llama-3 8B) and 2 for other two models. For each QPS, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
|
|
||||||
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
|
|
||||||
|
|
||||||
<!-- Check <a href="artifact://workspace/build/buildkite/vllm/performance-benchmark/.buildkite/nightly-benchmarks/tests/nightly-tests.json">nightly-tests.json</a> artifact for more details. -->
|
|
||||||
|
|
||||||
## Plots
|
|
||||||
|
|
||||||
In the following plots, the dot shows the mean and the error bar shows the standard error of the mean. Value 0 means that the corresponding benchmark crashed.
|
|
||||||
|
|
||||||
<img src="artifact://nightly_results.png" alt="Benchmarking results" height=250 >
|
|
||||||
|
|
||||||
## Results
|
|
||||||
|
|
||||||
{nightly_results_benchmarking_table}
|
|
@ -13,7 +13,7 @@ common_pod_spec: &common_pod_spec
|
|||||||
|
|
||||||
common_container_settings: &common_container_settings
|
common_container_settings: &common_container_settings
|
||||||
command:
|
command:
|
||||||
- bash .buildkite/nightly-benchmarks/run-nightly-suite.sh
|
- bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
|
||||||
resources:
|
resources:
|
||||||
limits:
|
limits:
|
||||||
nvidia.com/gpu: 8
|
nvidia.com/gpu: 8
|
||||||
@ -37,7 +37,10 @@ common_container_settings: &common_container_settings
|
|||||||
|
|
||||||
steps:
|
steps:
|
||||||
- block: ":rocket: Ready for comparing vllm against alternatives? This will take 4 hours."
|
- block: ":rocket: Ready for comparing vllm against alternatives? This will take 4 hours."
|
||||||
- label: "A100 trt benchmark"
|
|
||||||
|
|
||||||
|
|
||||||
|
- label: "A100 vllm step 10"
|
||||||
priority: 100
|
priority: 100
|
||||||
agents:
|
agents:
|
||||||
queue: A100
|
queue: A100
|
||||||
@ -46,7 +49,21 @@ steps:
|
|||||||
podSpec:
|
podSpec:
|
||||||
<<: *common_pod_spec
|
<<: *common_pod_spec
|
||||||
containers:
|
containers:
|
||||||
- image: nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
|
- image: vllm/vllm-openai:v0.6.2
|
||||||
|
<<: *common_container_settings
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
- label: "A100 sglang benchmark"
|
||||||
|
priority: 100
|
||||||
|
agents:
|
||||||
|
queue: A100
|
||||||
|
plugins:
|
||||||
|
- kubernetes:
|
||||||
|
podSpec:
|
||||||
|
<<: *common_pod_spec
|
||||||
|
containers:
|
||||||
|
- image: lmsysorg/sglang:v0.3.2-cu121
|
||||||
<<: *common_container_settings
|
<<: *common_container_settings
|
||||||
|
|
||||||
- label: "A100 lmdeploy benchmark"
|
- label: "A100 lmdeploy benchmark"
|
||||||
@ -58,11 +75,13 @@ steps:
|
|||||||
podSpec:
|
podSpec:
|
||||||
<<: *common_pod_spec
|
<<: *common_pod_spec
|
||||||
containers:
|
containers:
|
||||||
- image: openmmlab/lmdeploy:v0.5.0
|
- image: openmmlab/lmdeploy:v0.6.1-cu12
|
||||||
<<: *common_container_settings
|
<<: *common_container_settings
|
||||||
|
|
||||||
|
|
||||||
- label: "A100 vllm benchmark"
|
|
||||||
|
|
||||||
|
- label: "A100 trt llama-8B"
|
||||||
priority: 100
|
priority: 100
|
||||||
agents:
|
agents:
|
||||||
queue: A100
|
queue: A100
|
||||||
@ -71,10 +90,25 @@ steps:
|
|||||||
podSpec:
|
podSpec:
|
||||||
<<: *common_pod_spec
|
<<: *common_pod_spec
|
||||||
containers:
|
containers:
|
||||||
- image: vllm/vllm-openai:latest
|
- image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
|
||||||
<<: *common_container_settings
|
<<: *common_container_settings
|
||||||
|
env:
|
||||||
|
- name: VLLM_USAGE_SOURCE
|
||||||
|
value: ci-test
|
||||||
|
- name: HF_HOME
|
||||||
|
value: /root/.cache/huggingface
|
||||||
|
- name: VLLM_SOURCE_CODE_LOC
|
||||||
|
value: /workspace/build/buildkite/vllm/performance-benchmark
|
||||||
|
- name: HF_TOKEN
|
||||||
|
valueFrom:
|
||||||
|
secretKeyRef:
|
||||||
|
name: hf-token-secret
|
||||||
|
key: token
|
||||||
|
- name: TEST_SELECTOR
|
||||||
|
value: "llama8B"
|
||||||
|
|
||||||
- label: "A100 tgi benchmark"
|
|
||||||
|
- label: "A100 trt llama-70B"
|
||||||
priority: 100
|
priority: 100
|
||||||
agents:
|
agents:
|
||||||
queue: A100
|
queue: A100
|
||||||
@ -83,12 +117,54 @@ steps:
|
|||||||
podSpec:
|
podSpec:
|
||||||
<<: *common_pod_spec
|
<<: *common_pod_spec
|
||||||
containers:
|
containers:
|
||||||
- image: ghcr.io/huggingface/text-generation-inference:2.1
|
- image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
|
||||||
<<: *common_container_settings
|
<<: *common_container_settings
|
||||||
|
env:
|
||||||
|
- name: VLLM_USAGE_SOURCE
|
||||||
|
value: ci-test
|
||||||
|
- name: HF_HOME
|
||||||
|
value: /root/.cache/huggingface
|
||||||
|
- name: VLLM_SOURCE_CODE_LOC
|
||||||
|
value: /workspace/build/buildkite/vllm/performance-benchmark
|
||||||
|
- name: HF_TOKEN
|
||||||
|
valueFrom:
|
||||||
|
secretKeyRef:
|
||||||
|
name: hf-token-secret
|
||||||
|
key: token
|
||||||
|
- name: TEST_SELECTOR
|
||||||
|
value: "llama70B"
|
||||||
|
|
||||||
|
|
||||||
|
# FIXME(Kuntai): uncomment this after NVIDIA gives us their test docker image
|
||||||
|
# - label: "A100 trt benchmark"
|
||||||
|
# priority: 100
|
||||||
|
# agents:
|
||||||
|
# queue: A100
|
||||||
|
# plugins:
|
||||||
|
# - kubernetes:
|
||||||
|
# podSpec:
|
||||||
|
# <<: *common_pod_spec
|
||||||
|
# containers:
|
||||||
|
# - image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
|
||||||
|
# <<: *common_container_settings
|
||||||
|
|
||||||
|
|
||||||
|
# FIXME(Kuntai): uncomment this after TGI supports `--ignore-eos`.
|
||||||
|
# - label: "A100 tgi benchmark"
|
||||||
|
# priority: 100
|
||||||
|
# agents:
|
||||||
|
# queue: A100
|
||||||
|
# plugins:
|
||||||
|
# - kubernetes:
|
||||||
|
# podSpec:
|
||||||
|
# <<: *common_pod_spec
|
||||||
|
# containers:
|
||||||
|
# - image: ghcr.io/huggingface/text-generation-inference:2.2.0
|
||||||
|
# <<: *common_container_settings
|
||||||
|
|
||||||
- wait
|
- wait
|
||||||
|
|
||||||
- label: "Plot"
|
- label: "Collect the results"
|
||||||
priority: 100
|
priority: 100
|
||||||
agents:
|
agents:
|
||||||
queue: A100
|
queue: A100
|
||||||
@ -117,4 +193,4 @@ steps:
|
|||||||
name: hf-token-secret
|
name: hf-token-secret
|
||||||
key: token
|
key: token
|
||||||
|
|
||||||
- wait
|
- block: ":rocket: check the results!"
|
@ -1,76 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
|
|
||||||
set -o pipefail
|
|
||||||
set -x
|
|
||||||
|
|
||||||
check_gpus() {
|
|
||||||
# check the number of GPUs and GPU type.
|
|
||||||
declare -g gpu_count=$(nvidia-smi --list-gpus | wc -l)
|
|
||||||
if [[ $gpu_count -gt 0 ]]; then
|
|
||||||
echo "GPU found."
|
|
||||||
else
|
|
||||||
echo "Need at least 1 GPU to run benchmarking."
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
declare -g gpu_type=$(echo $(nvidia-smi --query-gpu=name --format=csv,noheader) | awk '{print $2}')
|
|
||||||
echo "GPU type is $gpu_type"
|
|
||||||
}
|
|
||||||
|
|
||||||
check_hf_token() {
|
|
||||||
# check if HF_TOKEN is available and valid
|
|
||||||
if [[ -z "$HF_TOKEN" ]]; then
|
|
||||||
echo "Error: HF_TOKEN is not set."
|
|
||||||
exit 1
|
|
||||||
elif [[ ! "$HF_TOKEN" =~ ^hf_ ]]; then
|
|
||||||
echo "Error: HF_TOKEN does not start with 'hf_'."
|
|
||||||
exit 1
|
|
||||||
else
|
|
||||||
echo "HF_TOKEN is set and valid."
|
|
||||||
fi
|
|
||||||
}
|
|
||||||
|
|
||||||
main() {
|
|
||||||
|
|
||||||
check_gpus
|
|
||||||
check_hf_token
|
|
||||||
|
|
||||||
df -h
|
|
||||||
|
|
||||||
(which wget && which curl) || (apt-get update && apt-get install -y wget curl)
|
|
||||||
(which jq) || (apt-get update && apt-get -y install jq)
|
|
||||||
|
|
||||||
cd $VLLM_SOURCE_CODE_LOC/benchmarks
|
|
||||||
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
|
|
||||||
|
|
||||||
|
|
||||||
# run lmdeploy
|
|
||||||
if which lmdeploy >/dev/null; then
|
|
||||||
echo "lmdeploy is available, redirect to run-lmdeploy-nightly.sh"
|
|
||||||
bash ../.buildkite/nightly-benchmarks/scripts/run-lmdeploy-nightly.sh
|
|
||||||
exit 0
|
|
||||||
fi
|
|
||||||
|
|
||||||
# run tgi
|
|
||||||
if [ -e /tgi-entrypoint.sh ]; then
|
|
||||||
echo "tgi is available, redirect to run-tgi-nightly.sh"
|
|
||||||
bash ../.buildkite/nightly-benchmarks/scripts/run-tgi-nightly.sh
|
|
||||||
exit 0
|
|
||||||
fi
|
|
||||||
|
|
||||||
# run trt
|
|
||||||
if which trtllm-build >/dev/null; then
|
|
||||||
echo "trtllm is available, redirect to run-trt-nightly.sh"
|
|
||||||
bash ../.buildkite/nightly-benchmarks/scripts/run-trt-nightly.sh
|
|
||||||
exit 0
|
|
||||||
fi
|
|
||||||
|
|
||||||
# run vllm
|
|
||||||
if [ -e /vllm-workspace ]; then
|
|
||||||
echo "vllm is available, redirect to run-vllm-nightly.sh"
|
|
||||||
bash ../.buildkite/nightly-benchmarks/scripts/run-vllm-nightly.sh
|
|
||||||
exit 0
|
|
||||||
fi
|
|
||||||
|
|
||||||
}
|
|
||||||
|
|
||||||
main "$@"
|
|
@ -0,0 +1,95 @@
|
|||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import pandas as pd
|
||||||
|
from tabulate import tabulate
|
||||||
|
|
||||||
|
|
||||||
|
def parse_arguments():
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description=
|
||||||
|
'Parse command line arguments for summary-nightly-results script.')
|
||||||
|
parser.add_argument('--results-folder',
|
||||||
|
type=str,
|
||||||
|
required=True,
|
||||||
|
help='The folder where the results are stored.')
|
||||||
|
parser.add_argument('--description',
|
||||||
|
type=str,
|
||||||
|
required=True,
|
||||||
|
help='Description of the results.')
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
return args
|
||||||
|
|
||||||
|
|
||||||
|
def get_perf(df, method, model, metric):
|
||||||
|
|
||||||
|
means = []
|
||||||
|
|
||||||
|
for qps in [2, 4, 8, 16, "inf"]:
|
||||||
|
target = df['Test name'].str.contains(model)
|
||||||
|
target = target & df['Engine'].str.contains(method)
|
||||||
|
target = target & df['Test name'].str.contains("qps_" + str(qps))
|
||||||
|
filtered_df = df[target]
|
||||||
|
|
||||||
|
if filtered_df.empty:
|
||||||
|
means.append(0.)
|
||||||
|
else:
|
||||||
|
means.append(filtered_df[metric].values[0])
|
||||||
|
|
||||||
|
return np.array(means)
|
||||||
|
|
||||||
|
|
||||||
|
def get_perf_w_std(df, method, model, metric):
|
||||||
|
|
||||||
|
if metric in ["TTFT", "ITL"]:
|
||||||
|
mean = get_perf(df, method, model, "Mean " + metric + " (ms)")
|
||||||
|
mean = mean.tolist()
|
||||||
|
std = get_perf(df, method, model, "Std " + metric + " (ms)")
|
||||||
|
if std.mean() == 0:
|
||||||
|
std = None
|
||||||
|
success = get_perf(df, method, model, "Successful req.")
|
||||||
|
if std is not None:
|
||||||
|
std = std / np.sqrt(success)
|
||||||
|
std = std.tolist()
|
||||||
|
|
||||||
|
else:
|
||||||
|
assert metric == "Tput"
|
||||||
|
mean = get_perf(df, method, model, "Input Tput (tok/s)") + get_perf(
|
||||||
|
df, method, model, "Output Tput (tok/s)")
|
||||||
|
mean = mean.tolist()
|
||||||
|
std = None
|
||||||
|
|
||||||
|
return mean, std
|
||||||
|
|
||||||
|
|
||||||
|
def main(args):
|
||||||
|
results_folder = Path(args.results_folder)
|
||||||
|
|
||||||
|
results = []
|
||||||
|
|
||||||
|
# collect results
|
||||||
|
for test_file in results_folder.glob("*_nightly_results.json"):
|
||||||
|
with open(test_file, "r") as f:
|
||||||
|
results = results + json.loads(f.read())
|
||||||
|
|
||||||
|
# generate markdown table
|
||||||
|
df = pd.DataFrame.from_dict(results)
|
||||||
|
|
||||||
|
md_table = tabulate(df, headers='keys', tablefmt='pipe', showindex=False)
|
||||||
|
|
||||||
|
with open(args.description, "r") as f:
|
||||||
|
description = f.read()
|
||||||
|
|
||||||
|
description = description.format(
|
||||||
|
nightly_results_benchmarking_table=md_table)
|
||||||
|
|
||||||
|
with open("nightly_results.md", "w") as f:
|
||||||
|
f.write(description)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
args = parse_arguments()
|
||||||
|
main(args)
|
241
.buildkite/nightly-benchmarks/scripts/launch-server.sh
Normal file
241
.buildkite/nightly-benchmarks/scripts/launch-server.sh
Normal file
@ -0,0 +1,241 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
# Currently FP8 benchmark is NOT enabled.
|
||||||
|
|
||||||
|
set -x
|
||||||
|
server_params=$1
|
||||||
|
common_params=$2
|
||||||
|
|
||||||
|
json2args() {
|
||||||
|
# transforms the JSON string to command line args, and '_' is replaced to '-'
|
||||||
|
# example:
|
||||||
|
# input: { "model": "meta-llama/Llama-2-7b-chat-hf", "tensor_parallel_size": 1 }
|
||||||
|
# output: --model meta-llama/Llama-2-7b-chat-hf --tensor-parallel-size 1
|
||||||
|
local json_string=$1
|
||||||
|
local args=$(
|
||||||
|
echo "$json_string" | jq -r '
|
||||||
|
to_entries |
|
||||||
|
map("--" + (.key | gsub("_"; "-")) + " " + (.value | tostring)) |
|
||||||
|
join(" ")
|
||||||
|
'
|
||||||
|
)
|
||||||
|
echo "$args"
|
||||||
|
}
|
||||||
|
|
||||||
|
launch_trt_server() {
|
||||||
|
|
||||||
|
model_path=$(echo "$common_params" | jq -r '.model')
|
||||||
|
model_name="${model_path#*/}"
|
||||||
|
model_type=$(echo "$server_params" | jq -r '.model_type')
|
||||||
|
model_dtype=$(echo "$server_params" | jq -r '.model_dtype')
|
||||||
|
model_tp_size=$(echo "$common_params" | jq -r '.tp')
|
||||||
|
max_batch_size=$(echo "$server_params" | jq -r '.max_batch_size')
|
||||||
|
max_input_len=$(echo "$server_params" | jq -r '.max_input_len')
|
||||||
|
max_seq_len=$(echo "$server_params" | jq -r '.max_seq_len')
|
||||||
|
max_num_tokens=$(echo "$server_params" | jq -r '.max_num_tokens')
|
||||||
|
trt_llm_version=$(echo "$server_params" | jq -r '.trt_llm_version')
|
||||||
|
|
||||||
|
# create model caching directory
|
||||||
|
cd ~
|
||||||
|
rm -rf models
|
||||||
|
mkdir -p models
|
||||||
|
cd models
|
||||||
|
models_dir=$(pwd)
|
||||||
|
trt_model_path=${models_dir}/${model_name}-trt-ckpt
|
||||||
|
trt_engine_path=${models_dir}/${model_name}-trt-engine
|
||||||
|
|
||||||
|
# clone tensorrt backend
|
||||||
|
cd /
|
||||||
|
rm -rf tensorrtllm_backend
|
||||||
|
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
|
||||||
|
git lfs install
|
||||||
|
cd tensorrtllm_backend
|
||||||
|
git checkout $trt_llm_version
|
||||||
|
tensorrtllm_backend_dir=$(pwd)
|
||||||
|
git submodule update --init --recursive
|
||||||
|
|
||||||
|
# build trtllm engine
|
||||||
|
cd /tensorrtllm_backend
|
||||||
|
cd ./tensorrt_llm/examples/${model_type}
|
||||||
|
python3 convert_checkpoint.py \
|
||||||
|
--model_dir ${model_path} \
|
||||||
|
--dtype ${model_dtype} \
|
||||||
|
--tp_size ${model_tp_size} \
|
||||||
|
--output_dir ${trt_model_path}
|
||||||
|
trtllm-build \
|
||||||
|
--checkpoint_dir ${trt_model_path} \
|
||||||
|
--use_fused_mlp \
|
||||||
|
--reduce_fusion disable \
|
||||||
|
--workers 8 \
|
||||||
|
--gpt_attention_plugin ${model_dtype} \
|
||||||
|
--gemm_plugin ${model_dtype} \
|
||||||
|
--tp_size ${model_tp_size} \
|
||||||
|
--max_batch_size ${max_batch_size} \
|
||||||
|
--max_input_len ${max_input_len} \
|
||||||
|
--max_seq_len ${max_seq_len} \
|
||||||
|
--max_num_tokens ${max_num_tokens} \
|
||||||
|
--output_dir ${trt_engine_path}
|
||||||
|
|
||||||
|
# handle triton protobuf files and launch triton server
|
||||||
|
cd /tensorrtllm_backend
|
||||||
|
mkdir triton_model_repo
|
||||||
|
cp -r all_models/inflight_batcher_llm/* triton_model_repo/
|
||||||
|
cd triton_model_repo
|
||||||
|
rm -rf ./tensorrt_llm/1/*
|
||||||
|
cp -r ${trt_engine_path}/* ./tensorrt_llm/1
|
||||||
|
python3 ../tools/fill_template.py -i tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,engine_dir:/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1,decoupled_mode:true,batching_strategy:inflight_fused_batching,batch_scheduler_policy:guaranteed_no_evict,exclude_input_in_output:true,triton_max_batch_size:2048,max_queue_delay_microseconds:0,max_beam_width:1,max_queue_size:2048,enable_kv_cache_reuse:false
|
||||||
|
python3 ../tools/fill_template.py -i preprocessing/config.pbtxt triton_max_batch_size:2048,tokenizer_dir:$model_path,preprocessing_instance_count:5
|
||||||
|
python3 ../tools/fill_template.py -i postprocessing/config.pbtxt triton_max_batch_size:2048,tokenizer_dir:$model_path,postprocessing_instance_count:5,skip_special_tokens:false
|
||||||
|
python3 ../tools/fill_template.py -i ensemble/config.pbtxt triton_max_batch_size:$max_batch_size
|
||||||
|
python3 ../tools/fill_template.py -i tensorrt_llm_bls/config.pbtxt triton_max_batch_size:$max_batch_size,decoupled_mode:true,accumulate_tokens:"False",bls_instance_count:1
|
||||||
|
cd /tensorrtllm_backend
|
||||||
|
python3 scripts/launch_triton_server.py \
|
||||||
|
--world_size=${model_tp_size} \
|
||||||
|
--model_repo=/tensorrtllm_backend/triton_model_repo &
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
launch_tgi_server() {
|
||||||
|
model=$(echo "$common_params" | jq -r '.model')
|
||||||
|
tp=$(echo "$common_params" | jq -r '.tp')
|
||||||
|
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
|
||||||
|
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
|
||||||
|
port=$(echo "$common_params" | jq -r '.port')
|
||||||
|
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
|
||||||
|
server_args=$(json2args "$server_params")
|
||||||
|
|
||||||
|
if echo "$common_params" | jq -e 'has("fp8")' >/dev/null; then
|
||||||
|
echo "Key 'fp8' exists in common params."
|
||||||
|
server_command="/tgi-entrypoint.sh \
|
||||||
|
--model-id $model \
|
||||||
|
--num-shard $tp \
|
||||||
|
--port $port \
|
||||||
|
--quantize fp8 \
|
||||||
|
$server_args"
|
||||||
|
else
|
||||||
|
echo "Key 'fp8' does not exist in common params."
|
||||||
|
server_command="/tgi-entrypoint.sh \
|
||||||
|
--model-id $model \
|
||||||
|
--num-shard $tp \
|
||||||
|
--port $port \
|
||||||
|
$server_args"
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "Server command: $server_command"
|
||||||
|
eval "$server_command" &
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
launch_lmdeploy_server() {
|
||||||
|
model=$(echo "$common_params" | jq -r '.model')
|
||||||
|
tp=$(echo "$common_params" | jq -r '.tp')
|
||||||
|
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
|
||||||
|
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
|
||||||
|
port=$(echo "$common_params" | jq -r '.port')
|
||||||
|
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
|
||||||
|
server_args=$(json2args "$server_params")
|
||||||
|
|
||||||
|
server_command="lmdeploy serve api_server $model \
|
||||||
|
--tp $tp \
|
||||||
|
--server-port $port \
|
||||||
|
$server_args"
|
||||||
|
|
||||||
|
# run the server
|
||||||
|
echo "Server command: $server_command"
|
||||||
|
bash -c "$server_command" &
|
||||||
|
}
|
||||||
|
|
||||||
|
launch_sglang_server() {
|
||||||
|
|
||||||
|
model=$(echo "$common_params" | jq -r '.model')
|
||||||
|
tp=$(echo "$common_params" | jq -r '.tp')
|
||||||
|
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
|
||||||
|
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
|
||||||
|
port=$(echo "$common_params" | jq -r '.port')
|
||||||
|
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
|
||||||
|
server_args=$(json2args "$server_params")
|
||||||
|
|
||||||
|
if echo "$common_params" | jq -e 'has("fp8")' >/dev/null; then
|
||||||
|
echo "Key 'fp8' exists in common params. Use neuralmagic fp8 model for convenience."
|
||||||
|
model=$(echo "$common_params" | jq -r '.neuralmagic_quantized_model')
|
||||||
|
server_command="python3 \
|
||||||
|
-m sglang.launch_server \
|
||||||
|
--tp $tp \
|
||||||
|
--model-path $model \
|
||||||
|
--port $port \
|
||||||
|
$server_args"
|
||||||
|
else
|
||||||
|
echo "Key 'fp8' does not exist in common params."
|
||||||
|
server_command="python3 \
|
||||||
|
-m sglang.launch_server \
|
||||||
|
--tp $tp \
|
||||||
|
--model-path $model \
|
||||||
|
--port $port \
|
||||||
|
$server_args"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# run the server
|
||||||
|
echo "Server command: $server_command"
|
||||||
|
eval "$server_command" &
|
||||||
|
}
|
||||||
|
|
||||||
|
launch_vllm_server() {
|
||||||
|
|
||||||
|
export VLLM_HOST_IP=$(hostname -I | awk '{print $1}')
|
||||||
|
|
||||||
|
model=$(echo "$common_params" | jq -r '.model')
|
||||||
|
tp=$(echo "$common_params" | jq -r '.tp')
|
||||||
|
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
|
||||||
|
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
|
||||||
|
port=$(echo "$common_params" | jq -r '.port')
|
||||||
|
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
|
||||||
|
server_args=$(json2args "$server_params")
|
||||||
|
|
||||||
|
if echo "$common_params" | jq -e 'has("fp8")' >/dev/null; then
|
||||||
|
echo "Key 'fp8' exists in common params. Use neuralmagic fp8 model for convenience."
|
||||||
|
model=$(echo "$common_params" | jq -r '.neuralmagic_quantized_model')
|
||||||
|
server_command="python3 \
|
||||||
|
-m vllm.entrypoints.openai.api_server \
|
||||||
|
-tp $tp \
|
||||||
|
--model $model \
|
||||||
|
--port $port \
|
||||||
|
$server_args"
|
||||||
|
else
|
||||||
|
echo "Key 'fp8' does not exist in common params."
|
||||||
|
server_command="python3 \
|
||||||
|
-m vllm.entrypoints.openai.api_server \
|
||||||
|
-tp $tp \
|
||||||
|
--model $model \
|
||||||
|
--port $port \
|
||||||
|
$server_args"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# run the server
|
||||||
|
echo "Server command: $server_command"
|
||||||
|
eval "$server_command" &
|
||||||
|
}
|
||||||
|
|
||||||
|
main() {
|
||||||
|
|
||||||
|
if [[ $CURRENT_LLM_SERVING_ENGINE == "trt" ]]; then
|
||||||
|
launch_trt_server
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ $CURRENT_LLM_SERVING_ENGINE == "tgi" ]]; then
|
||||||
|
launch_tgi_server
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ $CURRENT_LLM_SERVING_ENGINE == "lmdeploy" ]]; then
|
||||||
|
launch_lmdeploy_server
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ $CURRENT_LLM_SERVING_ENGINE == "sglang" ]]; then
|
||||||
|
launch_sglang_server
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ "$CURRENT_LLM_SERVING_ENGINE" == *"vllm"* ]]; then
|
||||||
|
launch_vllm_server
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
main
|
@ -1,102 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
|
|
||||||
|
|
||||||
server_params=$1
|
|
||||||
common_params=$2
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
model_path=$(echo "$common_params" | jq -r '.model')
|
|
||||||
model_name="${model_path#*/}"
|
|
||||||
model_type=$(echo "$server_params" | jq -r '.model_type')
|
|
||||||
model_dtype=$(echo "$server_params" | jq -r '.model_dtype')
|
|
||||||
model_tp_size=$(echo "$common_params" | jq -r '.tp')
|
|
||||||
max_batch_size=$(echo "$server_params" | jq -r '.max_batch_size')
|
|
||||||
max_input_len=$(echo "$server_params" | jq -r '.max_input_len')
|
|
||||||
max_output_len=$(echo "$server_params" | jq -r '.max_output_len')
|
|
||||||
trt_llm_version=$(echo "$server_params" | jq -r '.trt_llm_version')
|
|
||||||
|
|
||||||
cd ~
|
|
||||||
rm -rf models
|
|
||||||
mkdir -p models
|
|
||||||
cd models
|
|
||||||
models_dir=$(pwd)
|
|
||||||
trt_model_path=${models_dir}/${model_name}-trt-ckpt
|
|
||||||
trt_engine_path=${models_dir}/${model_name}-trt-engine
|
|
||||||
|
|
||||||
cd ~
|
|
||||||
rm -rf tensorrt-demo
|
|
||||||
git clone https://github.com/neuralmagic/tensorrt-demo.git
|
|
||||||
cd tensorrt-demo
|
|
||||||
tensorrt_demo_dir=$(pwd)
|
|
||||||
|
|
||||||
# make sure the parameter inside tensorrt_demo is consistent to envvar
|
|
||||||
sed -i.bak "/key: \"tokenizer_dir\"/,/string_value:/s|string_value: \".*\"|string_value: \"$model_path\"|" ./triton_model_repo/postprocessing/config.pbtxt
|
|
||||||
sed -i.bak "/key: \"tokenizer_dir\"/,/string_value:/s|string_value: \".*\"|string_value: \"$model_path\"|" ./triton_model_repo/preprocessing/config.pbtxt
|
|
||||||
sed -i.bak "s|\(max_batch_size:\s*\)[0-9]*|\1$max_batch_size|g" ./triton_model_repo/ensemble/config.pbtxt
|
|
||||||
sed -i.bak "s|\(max_batch_size:\s*\)[0-9]*|\1$max_batch_size|g" ./triton_model_repo/preprocessing/config.pbtxt
|
|
||||||
sed -i.bak "s|\(max_batch_size:\s*\)[0-9]*|\1$max_batch_size|g" ./triton_model_repo/postprocessing/config.pbtxt
|
|
||||||
sed -i.bak "s|\(max_batch_size:\s*\)[0-9]*|\1$max_batch_size|g" ./triton_model_repo/tensorrt_llm_bls/config.pbtxt
|
|
||||||
|
|
||||||
|
|
||||||
cd /
|
|
||||||
rm -rf tensorrtllm_backend
|
|
||||||
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
|
|
||||||
git lfs install
|
|
||||||
cd tensorrtllm_backend
|
|
||||||
git checkout $trt_llm_version
|
|
||||||
tensorrtllm_backend_dir=$(pwd)
|
|
||||||
git submodule update --init --recursive
|
|
||||||
cp -r ${tensorrt_demo_dir}/triton_model_repo ${tensorrtllm_backend_dir}/
|
|
||||||
|
|
||||||
cd /tensorrtllm_backend
|
|
||||||
cd ./tensorrt_llm/examples/${model_type}
|
|
||||||
|
|
||||||
|
|
||||||
if echo "$common_params" | jq -e 'has("fp8")' > /dev/null; then
|
|
||||||
|
|
||||||
echo "Key 'fp8' exists in common params. Use quantize.py instead of convert_checkpoint.py"
|
|
||||||
echo "Reference: https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/README.md"
|
|
||||||
python ../quantization/quantize.py \
|
|
||||||
--model_dir ${model_path} \
|
|
||||||
--dtype ${model_dtype} \
|
|
||||||
--tp_size ${model_tp_size} \
|
|
||||||
--output_dir ${trt_model_path} \
|
|
||||||
--qformat fp8 \
|
|
||||||
--kv_cache_dtype fp8 \
|
|
||||||
--calib_size 2
|
|
||||||
|
|
||||||
else
|
|
||||||
|
|
||||||
echo "Key 'fp8' does not exist in common params. Use convert_checkpoint.py"
|
|
||||||
python3 convert_checkpoint.py \
|
|
||||||
--model_dir ${model_path} \
|
|
||||||
--dtype ${model_dtype} \
|
|
||||||
--tp_size ${model_tp_size} \
|
|
||||||
--output_dir ${trt_model_path}
|
|
||||||
|
|
||||||
fi
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
trtllm-build \
|
|
||||||
--checkpoint_dir=${trt_model_path} \
|
|
||||||
--gpt_attention_plugin=${model_dtype} \
|
|
||||||
--gemm_plugin=${model_dtype} \
|
|
||||||
--remove_input_padding=enable \
|
|
||||||
--paged_kv_cache=enable \
|
|
||||||
--tp_size=${model_tp_size} \
|
|
||||||
--max_batch_size=${max_batch_size} \
|
|
||||||
--max_input_len=${max_input_len} \
|
|
||||||
--max_output_len=${max_output_len} \
|
|
||||||
--max_num_tokens=${max_output_len} \
|
|
||||||
--opt_num_tokens=${max_output_len} \
|
|
||||||
--output_dir=${trt_engine_path}
|
|
||||||
|
|
||||||
cd /tensorrtllm_backend/triton_model_repo
|
|
||||||
rm -rf ./tensorrt_llm/1/*
|
|
||||||
cp -r ${trt_engine_path}/* ./tensorrt_llm/1
|
|
||||||
cd /tensorrtllm_backend
|
|
||||||
python3 scripts/launch_triton_server.py \
|
|
||||||
--world_size=${model_tp_size} \
|
|
||||||
--model_repo=/tensorrtllm_backend/triton_model_repo &
|
|
@ -8,6 +8,7 @@ main() {
|
|||||||
|
|
||||||
(which wget && which curl) || (apt-get update && apt-get install -y wget curl)
|
(which wget && which curl) || (apt-get update && apt-get install -y wget curl)
|
||||||
(which jq) || (apt-get update && apt-get -y install jq)
|
(which jq) || (apt-get update && apt-get -y install jq)
|
||||||
|
(which zip) || (apt-get install -y zip)
|
||||||
|
|
||||||
if [ ! -f /workspace/buildkite-agent ]; then
|
if [ ! -f /workspace/buildkite-agent ]; then
|
||||||
echo "buildkite-agent binary not found. Skip plotting the results."
|
echo "buildkite-agent binary not found. Skip plotting the results."
|
||||||
@ -24,17 +25,54 @@ main() {
|
|||||||
ls
|
ls
|
||||||
ls results/
|
ls results/
|
||||||
|
|
||||||
# generate figures
|
# upload benchmark results
|
||||||
python3 -m pip install tabulate pandas matplotlib
|
zip -r results.zip results/
|
||||||
python3 $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/scripts/plot-nightly-results.py \
|
/workspace/buildkite-agent artifact upload "results.zip"
|
||||||
--description $description \
|
|
||||||
--results-folder results/
|
|
||||||
|
|
||||||
# upload results and figures
|
# upload benchmarking scripts
|
||||||
/workspace/buildkite-agent artifact upload "nightly_results.png"
|
cd $VLLM_SOURCE_CODE_LOC/
|
||||||
/workspace/buildkite-agent artifact upload $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/nightly-pipeline.yaml
|
zip -r nightly-benchmarks.zip .buildkite/ benchmarks/
|
||||||
/workspace/buildkite-agent artifact upload $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/tests/nightly-tests.json
|
/workspace/buildkite-agent artifact upload "nightly-benchmarks.zip"
|
||||||
/workspace/buildkite-agent annotate --style "success" --context "nightly-benchmarks-results" --append < nightly_results.md
|
|
||||||
|
cd $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/
|
||||||
|
# upload benchmarking pipeline
|
||||||
|
/workspace/buildkite-agent artifact upload "nightly-pipeline.yaml"
|
||||||
|
|
||||||
|
cd $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/
|
||||||
|
/workspace/buildkite-agent annotate --style "success" --context "nightly-benchmarks-results" --append < nightly-annotation.md
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
# The figures should be genereated by a separate process outside the CI/CD pipeline
|
||||||
|
|
||||||
|
# # generate figures
|
||||||
|
# python3 -m pip install tabulate pandas matplotlib
|
||||||
|
|
||||||
|
# python3 $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/scripts/generate-nightly-markdown.py \
|
||||||
|
# --description $description \
|
||||||
|
# --results-folder results/
|
||||||
|
|
||||||
|
|
||||||
|
# python3 $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/scripts/plot-nightly-results.py \
|
||||||
|
# --description $description \
|
||||||
|
# --results-folder results/ \
|
||||||
|
# --dataset sharegpt
|
||||||
|
|
||||||
|
# python3 $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/scripts/plot-nightly-results.py \
|
||||||
|
# --description $description \
|
||||||
|
# --results-folder results/ \
|
||||||
|
# --dataset sonnet_2048_128
|
||||||
|
|
||||||
|
# python3 $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/scripts/plot-nightly-results.py \
|
||||||
|
# --description $description \
|
||||||
|
# --results-folder results/ \
|
||||||
|
# --dataset sonnet_128_2048
|
||||||
|
|
||||||
|
# # upload results and figures
|
||||||
|
# /workspace/buildkite-agent artifact upload "nightly_results*.png"
|
||||||
|
# /workspace/buildkite-agent artifact upload $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/nightly-pipeline.yaml
|
||||||
|
# /workspace/buildkite-agent artifact upload $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/tests/nightly-tests.json
|
||||||
|
# /workspace/buildkite-agent annotate --style "success" --context "nightly-benchmarks-results" --append < nightly_results.md
|
||||||
}
|
}
|
||||||
|
|
||||||
main "$@"
|
main "$@"
|
@ -1,135 +0,0 @@
|
|||||||
import argparse
|
|
||||||
import json
|
|
||||||
import math
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
import matplotlib.pyplot as plt
|
|
||||||
import pandas as pd
|
|
||||||
from tabulate import tabulate
|
|
||||||
|
|
||||||
|
|
||||||
def parse_arguments():
|
|
||||||
parser = argparse.ArgumentParser(
|
|
||||||
description=
|
|
||||||
'Parse command line arguments for summary-nightly-results script.')
|
|
||||||
parser.add_argument('--results-folder',
|
|
||||||
type=str,
|
|
||||||
required=True,
|
|
||||||
help='The folder where the results are stored.')
|
|
||||||
parser.add_argument('--description',
|
|
||||||
type=str,
|
|
||||||
required=True,
|
|
||||||
help='Description of the results.')
|
|
||||||
|
|
||||||
args = parser.parse_args()
|
|
||||||
return args
|
|
||||||
|
|
||||||
|
|
||||||
def main(args):
|
|
||||||
bar_colors = ['#56B4E9', '#009E73', '#D55E00', '#E69F00']
|
|
||||||
results_folder = Path(args.results_folder)
|
|
||||||
|
|
||||||
results = []
|
|
||||||
|
|
||||||
# collect results
|
|
||||||
for test_file in results_folder.glob("*_nightly_results.json"):
|
|
||||||
with open(test_file, "r") as f:
|
|
||||||
results = results + json.loads(f.read())
|
|
||||||
|
|
||||||
# generate markdown table
|
|
||||||
df = pd.DataFrame.from_dict(results)
|
|
||||||
|
|
||||||
md_table = tabulate(df, headers='keys', tablefmt='pipe', showindex=False)
|
|
||||||
|
|
||||||
with open(args.description, "r") as f:
|
|
||||||
description = f.read()
|
|
||||||
|
|
||||||
description = description.format(
|
|
||||||
nightly_results_benchmarking_table=md_table)
|
|
||||||
|
|
||||||
with open("nightly_results.md", "w") as f:
|
|
||||||
f.write(description)
|
|
||||||
|
|
||||||
plt.rcParams.update({'font.size': 20})
|
|
||||||
|
|
||||||
# plot results
|
|
||||||
fig, axes = plt.subplots(3, 3, figsize=(16, 14))
|
|
||||||
fig.subplots_adjust(hspace=1)
|
|
||||||
methods = ["vllm", "trt", "lmdeploy", "tgi"]
|
|
||||||
for i, model in enumerate(["llama8B", "llama70B", "mixtral8x7B"]):
|
|
||||||
for j, metric in enumerate(["TTFT", "ITL"]):
|
|
||||||
means, stds = [], []
|
|
||||||
for method in methods:
|
|
||||||
target = df['Test name'].str.contains(model)
|
|
||||||
target = target & df['Engine'].str.contains(method)
|
|
||||||
filtered_df = df[target]
|
|
||||||
|
|
||||||
if filtered_df.empty:
|
|
||||||
means.append(0.)
|
|
||||||
stds.append(0.)
|
|
||||||
else:
|
|
||||||
means.append(filtered_df[f"Mean {metric} (ms)"].values[0])
|
|
||||||
std = filtered_df[f"Std {metric} (ms)"].values[0]
|
|
||||||
success = filtered_df["Successful req."].values[0]
|
|
||||||
stds.append(std / math.sqrt(success))
|
|
||||||
|
|
||||||
print(model, metric)
|
|
||||||
print(means, stds)
|
|
||||||
|
|
||||||
ax = axes[i, j + 1]
|
|
||||||
|
|
||||||
bars = ax.bar(
|
|
||||||
["vllm", "trt", "lmdeploy", "tgi"],
|
|
||||||
means,
|
|
||||||
yerr=stds,
|
|
||||||
capsize=10,
|
|
||||||
)
|
|
||||||
for idx, bar in enumerate(bars):
|
|
||||||
bar.set_color(bar_colors[idx])
|
|
||||||
ax.set_ylim(bottom=0)
|
|
||||||
|
|
||||||
ax.set_ylabel(f"{metric} (ms)")
|
|
||||||
ax.set_title(f"{model} {metric}")
|
|
||||||
ax.grid(axis='y')
|
|
||||||
|
|
||||||
metric = "Tput"
|
|
||||||
j = 0
|
|
||||||
if True:
|
|
||||||
tputs = []
|
|
||||||
for method in methods:
|
|
||||||
target = df['Test name'].str.contains(model)
|
|
||||||
target = target & df['Engine'].str.contains(method)
|
|
||||||
filtered_df = df[target]
|
|
||||||
|
|
||||||
if filtered_df.empty:
|
|
||||||
tputs.append(0.)
|
|
||||||
else:
|
|
||||||
input_tput = filtered_df["Input Tput (tok/s)"].values[0]
|
|
||||||
output_tput = filtered_df["Output Tput (tok/s)"].values[0]
|
|
||||||
tputs.append(input_tput + output_tput)
|
|
||||||
|
|
||||||
print(model, metric)
|
|
||||||
print(tputs)
|
|
||||||
|
|
||||||
ax = axes[i, j]
|
|
||||||
|
|
||||||
bars = ax.bar(
|
|
||||||
["vllm", "trt", "lmdeploy", "tgi"],
|
|
||||||
tputs,
|
|
||||||
)
|
|
||||||
for idx, bar in enumerate(bars):
|
|
||||||
bar.set_color(bar_colors[idx])
|
|
||||||
|
|
||||||
ax.set_ylim(bottom=0)
|
|
||||||
|
|
||||||
ax.set_ylabel("Tput (token/s)")
|
|
||||||
ax.set_title(f"{model} {metric}")
|
|
||||||
ax.grid(axis='y')
|
|
||||||
|
|
||||||
fig.tight_layout()
|
|
||||||
fig.savefig("nightly_results.png", bbox_inches='tight', dpi=400)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
|
||||||
args = parse_arguments()
|
|
||||||
main(args)
|
|
@ -1,218 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
check_gpus() {
|
|
||||||
# check the number of GPUs and GPU type.
|
|
||||||
declare -g gpu_count=$(nvidia-smi --list-gpus | wc -l)
|
|
||||||
if [[ $gpu_count -gt 0 ]]; then
|
|
||||||
echo "GPU found."
|
|
||||||
else
|
|
||||||
echo "Need at least 1 GPU to run benchmarking."
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
declare -g gpu_type=$(echo $(nvidia-smi --query-gpu=name --format=csv,noheader) | awk '{print $2}')
|
|
||||||
echo "GPU type is $gpu_type"
|
|
||||||
}
|
|
||||||
|
|
||||||
kill_gpu_processes() {
|
|
||||||
pkill lmdeploy || true
|
|
||||||
# waiting for GPU processes to be fully killed
|
|
||||||
sleep 10
|
|
||||||
# Print the GPU memory usage
|
|
||||||
# so that we know if all GPU processes are killed.
|
|
||||||
gpu_memory_usage=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -i 0)
|
|
||||||
# The memory usage should be 0 MB.
|
|
||||||
echo "GPU 0 Memory Usage: $gpu_memory_usage MB"
|
|
||||||
}
|
|
||||||
|
|
||||||
json2args() {
|
|
||||||
# transforms the JSON string to command line args, and '_' is replaced to '-'
|
|
||||||
# example:
|
|
||||||
# input: { "model": "meta-llama/Llama-2-7b-chat-hf", "tensor_parallel_size": 1 }
|
|
||||||
# output: --model meta-llama/Llama-2-7b-chat-hf --tensor-parallel-size 1
|
|
||||||
local json_string=$1
|
|
||||||
local args=$(
|
|
||||||
echo "$json_string" | jq -r '
|
|
||||||
to_entries |
|
|
||||||
map("--" + (.key | gsub("_"; "-")) + " " + (.value | tostring)) |
|
|
||||||
join(" ")
|
|
||||||
'
|
|
||||||
)
|
|
||||||
echo "$args"
|
|
||||||
}
|
|
||||||
|
|
||||||
wait_for_server() {
|
|
||||||
# wait for vllm server to start
|
|
||||||
# return 1 if vllm server crashes
|
|
||||||
timeout 1200 bash -c '
|
|
||||||
until curl -s localhost:8000/v1/completions > /dev/null; do
|
|
||||||
sleep 1
|
|
||||||
done' && return 0 || return 1
|
|
||||||
}
|
|
||||||
|
|
||||||
run_serving_tests() {
|
|
||||||
# run serving tests using `benchmark_serving.py`
|
|
||||||
# $1: a json file specifying serving test cases
|
|
||||||
|
|
||||||
local serving_test_file
|
|
||||||
serving_test_file=$1
|
|
||||||
|
|
||||||
# Iterate over serving tests
|
|
||||||
jq -c '.[]' "$serving_test_file" | while read -r params; do
|
|
||||||
# get the test name, and append the GPU type back to it.
|
|
||||||
test_name=$(echo "$params" | jq -r '.test_name')
|
|
||||||
|
|
||||||
# if TEST_SELECTOR is set, only run the test cases that match the selector
|
|
||||||
if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
|
|
||||||
echo "Skip test case $test_name."
|
|
||||||
continue
|
|
||||||
fi
|
|
||||||
|
|
||||||
# append lmdeploy to the test name
|
|
||||||
test_name=lmdeploy_$test_name
|
|
||||||
|
|
||||||
# get common parameters
|
|
||||||
common_params=$(echo "$params" | jq -r '.common_parameters')
|
|
||||||
model=$(echo "$common_params" | jq -r '.model')
|
|
||||||
tp=$(echo "$common_params" | jq -r '.tp')
|
|
||||||
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
|
|
||||||
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
|
|
||||||
port=$(echo "$common_params" | jq -r '.port')
|
|
||||||
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
# get client and server arguments
|
|
||||||
server_params=$(echo "$params" | jq -r '.lmdeploy_server_parameters')
|
|
||||||
client_params=$(echo "$params" | jq -r '.lmdeploy_client_parameters')
|
|
||||||
server_args=$(json2args "$server_params")
|
|
||||||
client_args=$(json2args "$client_params")
|
|
||||||
qps_list=$(echo "$params" | jq -r '.qps_list')
|
|
||||||
qps_list=$(echo "$qps_list" | jq -r '.[] | @sh')
|
|
||||||
echo "Running over qps list $qps_list"
|
|
||||||
|
|
||||||
# check if there is enough GPU to run the test
|
|
||||||
if [[ $gpu_count -lt $tp ]]; then
|
|
||||||
echo "Required tensor-parallel-size $tp but only $gpu_count GPU found. Skip testcase $test_name."
|
|
||||||
continue
|
|
||||||
fi
|
|
||||||
|
|
||||||
# prepare tokenizer
|
|
||||||
rm -rf /tokenizer_cache
|
|
||||||
mkdir /tokenizer_cache
|
|
||||||
python ../.buildkite/nightly-benchmarks/scripts/download-tokenizer.py \
|
|
||||||
--model "$model" \
|
|
||||||
--cachedir /tokenizer_cache
|
|
||||||
|
|
||||||
server_command="lmdeploy serve api_server $model \
|
|
||||||
--tp $tp \
|
|
||||||
--server-port $port \
|
|
||||||
$server_args"
|
|
||||||
|
|
||||||
# run the server
|
|
||||||
echo "Running test case $test_name"
|
|
||||||
echo "Server command: $server_command"
|
|
||||||
bash -c "$server_command" &
|
|
||||||
|
|
||||||
# wait until the server is alive
|
|
||||||
wait_for_server
|
|
||||||
if [ $? -eq 0 ]; then
|
|
||||||
echo ""
|
|
||||||
echo "lmdeploy server is up and running."
|
|
||||||
else
|
|
||||||
echo ""
|
|
||||||
echo "lmdeploy failed to start within the timeout period."
|
|
||||||
break
|
|
||||||
fi
|
|
||||||
|
|
||||||
# get model name
|
|
||||||
model_name=$(python ../.buildkite/nightly-benchmarks/scripts/get-lmdeploy-modelname.py)
|
|
||||||
|
|
||||||
# iterate over different QPS
|
|
||||||
for qps in $qps_list; do
|
|
||||||
# remove the surrounding single quote from qps
|
|
||||||
if [[ "$qps" == *"inf"* ]]; then
|
|
||||||
echo "qps was $qps"
|
|
||||||
qps="inf"
|
|
||||||
echo "now qps is $qps"
|
|
||||||
fi
|
|
||||||
|
|
||||||
new_test_name=$test_name"_qps_"$qps
|
|
||||||
|
|
||||||
client_command="python3 benchmark_serving.py \
|
|
||||||
--backend lmdeploy \
|
|
||||||
--tokenizer /tokenizer_cache \
|
|
||||||
--dataset-name $dataset_name \
|
|
||||||
--dataset-path $dataset_path \
|
|
||||||
--num-prompts $num_prompts \
|
|
||||||
--port $port \
|
|
||||||
--save-result \
|
|
||||||
--result-dir $RESULTS_FOLDER \
|
|
||||||
--result-filename ${new_test_name}.json \
|
|
||||||
--request-rate $qps \
|
|
||||||
--model \"$model_name\" \
|
|
||||||
$client_args"
|
|
||||||
|
|
||||||
echo "Running test case $test_name with qps $qps"
|
|
||||||
echo "Client command: $client_command"
|
|
||||||
|
|
||||||
eval "$client_command"
|
|
||||||
|
|
||||||
# record the benchmarking commands
|
|
||||||
jq_output=$(jq -n \
|
|
||||||
--arg server "$server_command" \
|
|
||||||
--arg client "$client_command" \
|
|
||||||
--arg gpu "$gpu_type" \
|
|
||||||
--arg engine "lmdeploy" \
|
|
||||||
'{
|
|
||||||
server_command: $server,
|
|
||||||
client_command: $client,
|
|
||||||
gpu_type: $gpu,
|
|
||||||
engine: $engine
|
|
||||||
}')
|
|
||||||
echo "$jq_output" >"$RESULTS_FOLDER/${new_test_name}.commands"
|
|
||||||
|
|
||||||
done
|
|
||||||
|
|
||||||
# clean up
|
|
||||||
kill_gpu_processes
|
|
||||||
rm -rf /root/.cache/huggingface/*
|
|
||||||
done
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
upload_to_buildkite() {
|
|
||||||
# upload the benchmarking results to buildkite
|
|
||||||
|
|
||||||
# if the agent binary is not found, skip uploading the results, exit 0
|
|
||||||
if [ ! -f /workspace/buildkite-agent ]; then
|
|
||||||
echo "buildkite-agent binary not found. Skip uploading the results."
|
|
||||||
return 0
|
|
||||||
fi
|
|
||||||
# /workspace/buildkite-agent annotate --style "success" --context "benchmark-results" --append < $RESULTS_FOLDER/${CURRENT_LLM_SERVING_ENGINE}_nightly_results.md
|
|
||||||
/workspace/buildkite-agent artifact upload "$RESULTS_FOLDER/*"
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
main() {
|
|
||||||
|
|
||||||
check_gpus
|
|
||||||
# enter vllm directory
|
|
||||||
cd $VLLM_SOURCE_CODE_LOC/benchmarks
|
|
||||||
|
|
||||||
declare -g RESULTS_FOLDER=results/
|
|
||||||
mkdir -p $RESULTS_FOLDER
|
|
||||||
BENCHMARK_ROOT=../.buildkite/nightly-benchmarks/
|
|
||||||
|
|
||||||
python -m pip install transformers==4.41.2
|
|
||||||
|
|
||||||
export CURRENT_LLM_SERVING_ENGINE=lmdeploy
|
|
||||||
run_serving_tests $BENCHMARK_ROOT/tests/nightly-tests.json
|
|
||||||
python -m pip install tabulate pandas
|
|
||||||
python $BENCHMARK_ROOT/scripts/summary-nightly-results.py
|
|
||||||
upload_to_buildkite
|
|
||||||
|
|
||||||
}
|
|
||||||
|
|
||||||
main "$@"
|
|
357
.buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
Normal file
357
.buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
Normal file
@ -0,0 +1,357 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
set -o pipefail
|
||||||
|
set -x
|
||||||
|
|
||||||
|
check_gpus() {
|
||||||
|
# check the number of GPUs and GPU type.
|
||||||
|
declare -g gpu_count=$(nvidia-smi --list-gpus | wc -l)
|
||||||
|
if [[ $gpu_count -gt 0 ]]; then
|
||||||
|
echo "GPU found."
|
||||||
|
else
|
||||||
|
echo "Need at least 1 GPU to run benchmarking."
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
declare -g gpu_type=$(echo $(nvidia-smi --query-gpu=name --format=csv,noheader) | awk '{print $2}')
|
||||||
|
echo "GPU type is $gpu_type"
|
||||||
|
}
|
||||||
|
|
||||||
|
check_hf_token() {
|
||||||
|
# check if HF_TOKEN is available and valid
|
||||||
|
if [[ -z "$HF_TOKEN" ]]; then
|
||||||
|
echo "Error: HF_TOKEN is not set."
|
||||||
|
exit 1
|
||||||
|
elif [[ ! "$HF_TOKEN" =~ ^hf_ ]]; then
|
||||||
|
echo "Error: HF_TOKEN does not start with 'hf_'."
|
||||||
|
exit 1
|
||||||
|
else
|
||||||
|
echo "HF_TOKEN is set and valid."
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
upload_to_buildkite() {
|
||||||
|
# upload the benchmarking results to buildkite
|
||||||
|
|
||||||
|
# if the agent binary is not found, skip uploading the results, exit 0
|
||||||
|
if [ ! -f /workspace/buildkite-agent ]; then
|
||||||
|
echo "buildkite-agent binary not found. Skip uploading the results."
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
# /workspace/buildkite-agent annotate --style "success" --context "benchmark-results" --append < $RESULTS_FOLDER/${CURRENT_LLM_SERVING_ENGINE}_nightly_results.md
|
||||||
|
/workspace/buildkite-agent artifact upload "$RESULTS_FOLDER/*"
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
get_current_llm_serving_engine() {
|
||||||
|
|
||||||
|
if which lmdeploy >/dev/null; then
|
||||||
|
echo "Container: lmdeploy"
|
||||||
|
export CURRENT_LLM_SERVING_ENGINE=lmdeploy
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ -e /tgi-entrypoint.sh ]; then
|
||||||
|
echo "Container: tgi"
|
||||||
|
export CURRENT_LLM_SERVING_ENGINE=tgi
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
if which trtllm-build >/dev/null; then
|
||||||
|
echo "Container: tensorrt-llm"
|
||||||
|
export CURRENT_LLM_SERVING_ENGINE=trt
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ -e /sgl-workspace ]; then
|
||||||
|
echo "Container: sglang"
|
||||||
|
export CURRENT_LLM_SERVING_ENGINE=sglang
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ -e /vllm-workspace ]; then
|
||||||
|
echo "Container: vllm"
|
||||||
|
# move to a completely irrelevant directory, to avoid import vllm from current folder
|
||||||
|
export CURRENT_LLM_SERVING_ENGINE=vllm
|
||||||
|
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
json2args() {
|
||||||
|
# transforms the JSON string to command line args, and '_' is replaced to '-'
|
||||||
|
# example:
|
||||||
|
# input: { "model": "meta-llama/Llama-2-7b-chat-hf", "tensor_parallel_size": 1 }
|
||||||
|
# output: --model meta-llama/Llama-2-7b-chat-hf --tensor-parallel-size 1
|
||||||
|
local json_string=$1
|
||||||
|
local args=$(
|
||||||
|
echo "$json_string" | jq -r '
|
||||||
|
to_entries |
|
||||||
|
map("--" + (.key | gsub("_"; "-")) + " " + (.value | tostring)) |
|
||||||
|
join(" ")
|
||||||
|
'
|
||||||
|
)
|
||||||
|
echo "$args"
|
||||||
|
}
|
||||||
|
|
||||||
|
kill_gpu_processes() {
|
||||||
|
pkill -f python
|
||||||
|
pkill -f python3
|
||||||
|
pkill -f tritonserver
|
||||||
|
pkill -f pt_main_thread
|
||||||
|
pkill -f text-generation
|
||||||
|
pkill -f lmdeploy
|
||||||
|
|
||||||
|
while [ $(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | head -n 1) -ge 1000 ]; do
|
||||||
|
sleep 1
|
||||||
|
done
|
||||||
|
}
|
||||||
|
|
||||||
|
wait_for_server() {
|
||||||
|
# wait for vllm server to start
|
||||||
|
# return 1 if vllm server crashes
|
||||||
|
timeout 1200 bash -c '
|
||||||
|
until curl -s localhost:8000/v1/completions > /dev/null; do
|
||||||
|
sleep 1
|
||||||
|
done' && return 0 || return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
ensure_installed() {
|
||||||
|
# Ensure that the given command is installed by apt-get
|
||||||
|
local cmd=$1
|
||||||
|
if ! which $cmd >/dev/null; then
|
||||||
|
apt-get update && apt-get install -y $cmd
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
run_serving_tests() {
|
||||||
|
# run serving tests using `benchmark_serving.py`
|
||||||
|
# $1: a json file specifying serving test cases
|
||||||
|
|
||||||
|
local serving_test_file
|
||||||
|
serving_test_file=$1
|
||||||
|
|
||||||
|
# Iterate over serving tests
|
||||||
|
jq -c '.[]' "$serving_test_file" | while read -r params; do
|
||||||
|
# get the test name, and append the GPU type back to it.
|
||||||
|
test_name=$(echo "$params" | jq -r '.test_name')
|
||||||
|
|
||||||
|
# if TEST_SELECTOR is set, only run the test cases that match the selector
|
||||||
|
if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
|
||||||
|
echo "Skip test case $test_name."
|
||||||
|
continue
|
||||||
|
fi
|
||||||
|
|
||||||
|
# prepend the current serving engine to the test name
|
||||||
|
test_name=${CURRENT_LLM_SERVING_ENGINE}_${test_name}
|
||||||
|
|
||||||
|
# get common parameters
|
||||||
|
common_params=$(echo "$params" | jq -r '.common_parameters')
|
||||||
|
model=$(echo "$common_params" | jq -r '.model')
|
||||||
|
tp=$(echo "$common_params" | jq -r '.tp')
|
||||||
|
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
|
||||||
|
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
|
||||||
|
port=$(echo "$common_params" | jq -r '.port')
|
||||||
|
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
|
||||||
|
reuse_server=$(echo "$common_params" | jq -r '.reuse_server')
|
||||||
|
|
||||||
|
# get client and server arguments
|
||||||
|
server_params=$(echo "$params" | jq -r ".${CURRENT_LLM_SERVING_ENGINE}_server_parameters")
|
||||||
|
client_params=$(echo "$params" | jq -r ".${CURRENT_LLM_SERVING_ENGINE}_client_parameters")
|
||||||
|
client_args=$(json2args "$client_params")
|
||||||
|
qps_list=$(echo "$params" | jq -r '.qps_list')
|
||||||
|
qps_list=$(echo "$qps_list" | jq -r '.[] | @sh')
|
||||||
|
echo "Running over qps list $qps_list"
|
||||||
|
|
||||||
|
# check if there is enough GPU to run the test
|
||||||
|
if [[ $gpu_count -lt $tp ]]; then
|
||||||
|
echo "Required num-shard $tp but only $gpu_count GPU found. Skip testcase $test_name."
|
||||||
|
continue
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ $reuse_server == "true" ]]; then
|
||||||
|
echo "Reuse previous server for test case $test_name"
|
||||||
|
else
|
||||||
|
kill_gpu_processes
|
||||||
|
bash $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/scripts/launch-server.sh \
|
||||||
|
"$server_params" "$common_params"
|
||||||
|
fi
|
||||||
|
|
||||||
|
wait_for_server
|
||||||
|
|
||||||
|
if [ $? -eq 0 ]; then
|
||||||
|
echo ""
|
||||||
|
echo "$CURRENT_LLM_SERVING_ENGINE server is up and running."
|
||||||
|
else
|
||||||
|
echo ""
|
||||||
|
echo "$CURRENT_LLM_SERVING_ENGINE failed to start within the timeout period."
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
|
||||||
|
# prepare tokenizer
|
||||||
|
# this is required for lmdeploy.
|
||||||
|
cd $VLLM_SOURCE_CODE_LOC/benchmarks
|
||||||
|
rm -rf /tokenizer_cache
|
||||||
|
mkdir /tokenizer_cache
|
||||||
|
python3 ../.buildkite/nightly-benchmarks/scripts/download-tokenizer.py \
|
||||||
|
--model "$model" \
|
||||||
|
--cachedir /tokenizer_cache
|
||||||
|
cd $VLLM_SOURCE_CODE_LOC/benchmarks
|
||||||
|
|
||||||
|
|
||||||
|
# change model name for lmdeploy (it will not follow standard hf name)
|
||||||
|
if [[ "$CURRENT_LLM_SERVING_ENGINE" == "lmdeploy" ]]; then
|
||||||
|
model=$(python ../.buildkite/nightly-benchmarks/scripts/get-lmdeploy-modelname.py)
|
||||||
|
fi
|
||||||
|
|
||||||
|
# iterate over different QPS
|
||||||
|
for qps in $qps_list; do
|
||||||
|
# remove the surrounding single quote from qps
|
||||||
|
if [[ "$qps" == *"inf"* ]]; then
|
||||||
|
echo "qps was $qps"
|
||||||
|
qps="inf"
|
||||||
|
echo "now qps is $qps"
|
||||||
|
fi
|
||||||
|
|
||||||
|
new_test_name=$test_name"_qps_"$qps
|
||||||
|
|
||||||
|
backend=$CURRENT_LLM_SERVING_ENGINE
|
||||||
|
|
||||||
|
if [[ $backend = "trt" ]]; then
|
||||||
|
backend="tensorrt-llm"
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ "$backend" == *"vllm"* ]]; then
|
||||||
|
backend="vllm"
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ "$dataset_name" = "sharegpt" ]]; then
|
||||||
|
|
||||||
|
client_command="python3 benchmark_serving.py \
|
||||||
|
--backend $backend \
|
||||||
|
--tokenizer /tokenizer_cache \
|
||||||
|
--model $model \
|
||||||
|
--dataset-name $dataset_name \
|
||||||
|
--dataset-path $dataset_path \
|
||||||
|
--num-prompts $num_prompts \
|
||||||
|
--port $port \
|
||||||
|
--save-result \
|
||||||
|
--result-dir $RESULTS_FOLDER \
|
||||||
|
--result-filename ${new_test_name}.json \
|
||||||
|
--request-rate $qps \
|
||||||
|
--ignore-eos \
|
||||||
|
$client_args"
|
||||||
|
|
||||||
|
elif [[ "$dataset_name" = "sonnet" ]]; then
|
||||||
|
|
||||||
|
sonnet_input_len=$(echo "$common_params" | jq -r '.sonnet_input_len')
|
||||||
|
sonnet_output_len=$(echo "$common_params" | jq -r '.sonnet_output_len')
|
||||||
|
sonnet_prefix_len=$(echo "$common_params" | jq -r '.sonnet_prefix_len')
|
||||||
|
|
||||||
|
client_command="python3 benchmark_serving.py \
|
||||||
|
--backend $backend \
|
||||||
|
--tokenizer /tokenizer_cache \
|
||||||
|
--model $model \
|
||||||
|
--dataset-name $dataset_name \
|
||||||
|
--dataset-path $dataset_path \
|
||||||
|
--num-prompts $num_prompts \
|
||||||
|
--sonnet-input-len $sonnet_input_len \
|
||||||
|
--sonnet-output-len $sonnet_output_len \
|
||||||
|
--sonnet-prefix-len $sonnet_prefix_len \
|
||||||
|
--port $port \
|
||||||
|
--save-result \
|
||||||
|
--result-dir $RESULTS_FOLDER \
|
||||||
|
--result-filename ${new_test_name}.json \
|
||||||
|
--request-rate $qps \
|
||||||
|
--ignore-eos \
|
||||||
|
$client_args"
|
||||||
|
|
||||||
|
else
|
||||||
|
|
||||||
|
echo "The dataset name must be either 'sharegpt' or 'sonnet'. Got $dataset_name."
|
||||||
|
exit 1
|
||||||
|
|
||||||
|
fi
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
echo "Running test case $test_name with qps $qps"
|
||||||
|
echo "Client command: $client_command"
|
||||||
|
|
||||||
|
eval "$client_command"
|
||||||
|
|
||||||
|
server_command="None"
|
||||||
|
|
||||||
|
# record the benchmarking commands
|
||||||
|
jq_output=$(jq -n \
|
||||||
|
--arg server "$server_command" \
|
||||||
|
--arg client "$client_command" \
|
||||||
|
--arg gpu "$gpu_type" \
|
||||||
|
--arg engine "$CURRENT_LLM_SERVING_ENGINE" \
|
||||||
|
'{
|
||||||
|
server_command: $server,
|
||||||
|
client_command: $client,
|
||||||
|
gpu_type: $gpu,
|
||||||
|
engine: $engine
|
||||||
|
}')
|
||||||
|
echo "$jq_output" >"$RESULTS_FOLDER/${new_test_name}.commands"
|
||||||
|
|
||||||
|
done
|
||||||
|
|
||||||
|
done
|
||||||
|
|
||||||
|
kill_gpu_processes
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
prepare_dataset() {
|
||||||
|
|
||||||
|
# download sharegpt dataset
|
||||||
|
cd $VLLM_SOURCE_CODE_LOC/benchmarks
|
||||||
|
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
|
||||||
|
|
||||||
|
# duplicate sonnet by 4x, to allow benchmarking with input length 2048
|
||||||
|
cd $VLLM_SOURCE_CODE_LOC/benchmarks
|
||||||
|
echo "" > sonnet_4x.txt
|
||||||
|
for _ in {1..4}
|
||||||
|
do
|
||||||
|
cat sonnet.txt >> sonnet_4x.txt
|
||||||
|
done
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
main() {
|
||||||
|
|
||||||
|
# check if the environment variable is successfully injected from yaml
|
||||||
|
|
||||||
|
check_gpus
|
||||||
|
check_hf_token
|
||||||
|
get_current_llm_serving_engine
|
||||||
|
|
||||||
|
pip install -U transformers
|
||||||
|
|
||||||
|
# check storage
|
||||||
|
df -h
|
||||||
|
|
||||||
|
ensure_installed wget
|
||||||
|
ensure_installed curl
|
||||||
|
ensure_installed jq
|
||||||
|
|
||||||
|
prepare_dataset
|
||||||
|
|
||||||
|
cd $VLLM_SOURCE_CODE_LOC/benchmarks
|
||||||
|
declare -g RESULTS_FOLDER=results/
|
||||||
|
mkdir -p $RESULTS_FOLDER
|
||||||
|
BENCHMARK_ROOT=$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/
|
||||||
|
|
||||||
|
# run the test
|
||||||
|
run_serving_tests $BENCHMARK_ROOT/tests/nightly-tests.json
|
||||||
|
|
||||||
|
# upload benchmark results to buildkite
|
||||||
|
python3 -m pip install tabulate pandas
|
||||||
|
python3 $BENCHMARK_ROOT/scripts/summary-nightly-results.py
|
||||||
|
upload_to_buildkite
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
main "$@"
|
@ -1,216 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
check_gpus() {
|
|
||||||
# check the number of GPUs and GPU type.
|
|
||||||
declare -g gpu_count=$(nvidia-smi --list-gpus | wc -l)
|
|
||||||
if [[ $gpu_count -gt 0 ]]; then
|
|
||||||
echo "GPU found."
|
|
||||||
else
|
|
||||||
echo "Need at least 1 GPU to run benchmarking."
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
declare -g gpu_type=$(echo $(nvidia-smi --query-gpu=name --format=csv,noheader) | awk '{print $2}')
|
|
||||||
echo "GPU type is $gpu_type"
|
|
||||||
}
|
|
||||||
|
|
||||||
kill_gpu_processes() {
|
|
||||||
pkill text-generation || true
|
|
||||||
# waiting for GPU processes to be fully killed
|
|
||||||
sleep 10
|
|
||||||
# Print the GPU memory usage
|
|
||||||
# so that we know if all GPU processes are killed.
|
|
||||||
gpu_memory_usage=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -i 0)
|
|
||||||
# The memory usage should be 0 MB.
|
|
||||||
echo "GPU 0 Memory Usage: $gpu_memory_usage MB"
|
|
||||||
}
|
|
||||||
|
|
||||||
json2args() {
|
|
||||||
# transforms the JSON string to command line args, and '_' is replaced to '-'
|
|
||||||
# example:
|
|
||||||
# input: { "model": "meta-llama/Llama-2-7b-chat-hf", "tensor_parallel_size": 1 }
|
|
||||||
# output: --model meta-llama/Llama-2-7b-chat-hf --tensor-parallel-size 1
|
|
||||||
local json_string=$1
|
|
||||||
local args=$(
|
|
||||||
echo "$json_string" | jq -r '
|
|
||||||
to_entries |
|
|
||||||
map("--" + (.key | gsub("_"; "-")) + " " + (.value | tostring)) |
|
|
||||||
join(" ")
|
|
||||||
'
|
|
||||||
)
|
|
||||||
echo "$args"
|
|
||||||
}
|
|
||||||
|
|
||||||
wait_for_server() {
|
|
||||||
timeout 1200 bash -c '
|
|
||||||
until curl -s localhost:8000/generate_stream > /dev/null; do
|
|
||||||
sleep 1
|
|
||||||
done' && return 0 || return 1
|
|
||||||
}
|
|
||||||
|
|
||||||
run_serving_tests() {
|
|
||||||
# run serving tests using `benchmark_serving.py`
|
|
||||||
# $1: a json file specifying serving test cases
|
|
||||||
|
|
||||||
local serving_test_file
|
|
||||||
serving_test_file=$1
|
|
||||||
|
|
||||||
# Iterate over serving tests
|
|
||||||
jq -c '.[]' "$serving_test_file" | while read -r params; do
|
|
||||||
# get the test name, and append the GPU type back to it.
|
|
||||||
test_name=$(echo "$params" | jq -r '.test_name')
|
|
||||||
|
|
||||||
|
|
||||||
# if TEST_SELECTOR is set, only run the test cases that match the selector
|
|
||||||
if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
|
|
||||||
echo "Skip test case $test_name."
|
|
||||||
continue
|
|
||||||
fi
|
|
||||||
|
|
||||||
# append tgi to the test name
|
|
||||||
test_name=tgi_$test_name
|
|
||||||
|
|
||||||
# get common parameters
|
|
||||||
common_params=$(echo "$params" | jq -r '.common_parameters')
|
|
||||||
model=$(echo "$common_params" | jq -r '.model')
|
|
||||||
tp=$(echo "$common_params" | jq -r '.tp')
|
|
||||||
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
|
|
||||||
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
|
|
||||||
port=$(echo "$common_params" | jq -r '.port')
|
|
||||||
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
|
|
||||||
|
|
||||||
# get client and server arguments
|
|
||||||
server_params=$(echo "$params" | jq -r '.tgi_server_parameters')
|
|
||||||
client_params=$(echo "$params" | jq -r '.tgi_client_parameters')
|
|
||||||
server_args=$(json2args "$server_params")
|
|
||||||
client_args=$(json2args "$client_params")
|
|
||||||
qps_list=$(echo "$params" | jq -r '.qps_list')
|
|
||||||
qps_list=$(echo "$qps_list" | jq -r '.[] | @sh')
|
|
||||||
echo "Running over qps list $qps_list"
|
|
||||||
|
|
||||||
# check if there is enough GPU to run the test
|
|
||||||
if [[ $gpu_count -lt $tp ]]; then
|
|
||||||
echo "Required num-shard $tp but only $gpu_count GPU found. Skip testcase $test_name."
|
|
||||||
continue
|
|
||||||
fi
|
|
||||||
|
|
||||||
if echo "$common_params" | jq -e 'has("fp8")' > /dev/null; then
|
|
||||||
echo "Key 'fp8' exists in common params."
|
|
||||||
server_command="/tgi-entrypoint.sh \
|
|
||||||
--model-id $model \
|
|
||||||
--num-shard $tp \
|
|
||||||
--port $port \
|
|
||||||
--quantize fp8 \
|
|
||||||
$server_args"
|
|
||||||
else
|
|
||||||
echo "Key 'fp8' does not exist in common params."
|
|
||||||
server_command="/tgi-entrypoint.sh \
|
|
||||||
--model-id $model \
|
|
||||||
--num-shard $tp \
|
|
||||||
--port $port \
|
|
||||||
$server_args"
|
|
||||||
fi
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
# run the server
|
|
||||||
echo "Running test case $test_name"
|
|
||||||
echo "Server command: $server_command"
|
|
||||||
eval "$server_command" &
|
|
||||||
|
|
||||||
# wait until the server is alive
|
|
||||||
wait_for_server
|
|
||||||
if [ $? -eq 0 ]; then
|
|
||||||
echo ""
|
|
||||||
echo "tgi server is up and running."
|
|
||||||
else
|
|
||||||
echo ""
|
|
||||||
echo "tgi failed to start within the timeout period."
|
|
||||||
break
|
|
||||||
fi
|
|
||||||
|
|
||||||
# iterate over different QPS
|
|
||||||
for qps in $qps_list; do
|
|
||||||
# remove the surrounding single quote from qps
|
|
||||||
if [[ "$qps" == *"inf"* ]]; then
|
|
||||||
echo "qps was $qps"
|
|
||||||
qps="inf"
|
|
||||||
echo "now qps is $qps"
|
|
||||||
fi
|
|
||||||
|
|
||||||
new_test_name=$test_name"_qps_"$qps
|
|
||||||
|
|
||||||
client_command="python3 benchmark_serving.py \
|
|
||||||
--backend tgi \
|
|
||||||
--model $model \
|
|
||||||
--dataset-name $dataset_name \
|
|
||||||
--dataset-path $dataset_path \
|
|
||||||
--num-prompts $num_prompts \
|
|
||||||
--port $port \
|
|
||||||
--save-result \
|
|
||||||
--result-dir $RESULTS_FOLDER \
|
|
||||||
--result-filename ${new_test_name}.json \
|
|
||||||
--request-rate $qps \
|
|
||||||
$client_args"
|
|
||||||
|
|
||||||
echo "Running test case $test_name with qps $qps"
|
|
||||||
echo "Client command: $client_command"
|
|
||||||
|
|
||||||
eval "$client_command"
|
|
||||||
|
|
||||||
# record the benchmarking commands
|
|
||||||
jq_output=$(jq -n \
|
|
||||||
--arg server "$server_command" \
|
|
||||||
--arg client "$client_command" \
|
|
||||||
--arg gpu "$gpu_type" \
|
|
||||||
--arg engine "tgi" \
|
|
||||||
'{
|
|
||||||
server_command: $server,
|
|
||||||
client_command: $client,
|
|
||||||
gpu_type: $gpu,
|
|
||||||
engine: $engine
|
|
||||||
}')
|
|
||||||
echo "$jq_output" >"$RESULTS_FOLDER/${new_test_name}.commands"
|
|
||||||
|
|
||||||
done
|
|
||||||
|
|
||||||
# clean up
|
|
||||||
kill_gpu_processes
|
|
||||||
rm -rf /root/.cache/huggingface/*
|
|
||||||
done
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
upload_to_buildkite() {
|
|
||||||
# upload the benchmarking results to buildkite
|
|
||||||
|
|
||||||
# if the agent binary is not found, skip uploading the results, exit 0
|
|
||||||
if [ ! -f /workspace/buildkite-agent ]; then
|
|
||||||
echo "buildkite-agent binary not found. Skip uploading the results."
|
|
||||||
return 0
|
|
||||||
fi
|
|
||||||
# /workspace/buildkite-agent annotate --style "success" --context "benchmark-results" --append < $RESULTS_FOLDER/${CURRENT_LLM_SERVING_ENGINE}_nightly_results.md
|
|
||||||
/workspace/buildkite-agent artifact upload "$RESULTS_FOLDER/*"
|
|
||||||
}
|
|
||||||
|
|
||||||
main() {
|
|
||||||
|
|
||||||
check_gpus
|
|
||||||
# enter vllm directory
|
|
||||||
cd $VLLM_SOURCE_CODE_LOC/benchmarks
|
|
||||||
declare -g RESULTS_FOLDER=results/
|
|
||||||
mkdir -p $RESULTS_FOLDER
|
|
||||||
BENCHMARK_ROOT=../.buildkite/nightly-benchmarks/
|
|
||||||
|
|
||||||
export CURRENT_LLM_SERVING_ENGINE=tgi
|
|
||||||
run_serving_tests $BENCHMARK_ROOT/tests/nightly-tests.json
|
|
||||||
python -m pip install tabulate pandas
|
|
||||||
python $BENCHMARK_ROOT/scripts/summary-nightly-results.py
|
|
||||||
upload_to_buildkite
|
|
||||||
|
|
||||||
}
|
|
||||||
|
|
||||||
main "$@"
|
|
@ -1,214 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
check_gpus() {
|
|
||||||
# check the number of GPUs and GPU type.
|
|
||||||
declare -g gpu_count=$(nvidia-smi --list-gpus | wc -l)
|
|
||||||
if [[ $gpu_count -gt 0 ]]; then
|
|
||||||
echo "GPU found."
|
|
||||||
else
|
|
||||||
echo "Need at least 1 GPU to run benchmarking."
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
declare -g gpu_type=$(echo $(nvidia-smi --query-gpu=name --format=csv,noheader) | awk '{print $2}')
|
|
||||||
echo "GPU type is $gpu_type"
|
|
||||||
}
|
|
||||||
|
|
||||||
kill_gpu_processes() {
|
|
||||||
pkill tritonserver || true
|
|
||||||
# waiting for GPU processes to be fully killed
|
|
||||||
sleep 20
|
|
||||||
# Print the GPU memory usage
|
|
||||||
# so that we know if all GPU processes are killed.
|
|
||||||
gpu_memory_usage=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -i 0)
|
|
||||||
# The memory usage should be 0 MB.
|
|
||||||
echo "GPU 0 Memory Usage: $gpu_memory_usage MB"
|
|
||||||
}
|
|
||||||
|
|
||||||
json2args() {
|
|
||||||
# transforms the JSON string to command line args, and '_' is replaced to '-'
|
|
||||||
# example:
|
|
||||||
# input: { "model": "meta-llama/Llama-2-7b-chat-hf", "tensor_parallel_size": 1 }
|
|
||||||
# output: --model meta-llama/Llama-2-7b-chat-hf --tensor-parallel-size 1
|
|
||||||
local json_string=$1
|
|
||||||
local args=$(
|
|
||||||
echo "$json_string" | jq -r '
|
|
||||||
to_entries |
|
|
||||||
map("--" + (.key | gsub("_"; "-")) + " " + (.value | tostring)) |
|
|
||||||
join(" ")
|
|
||||||
'
|
|
||||||
)
|
|
||||||
echo "$args"
|
|
||||||
}
|
|
||||||
|
|
||||||
wait_for_server() {
|
|
||||||
timeout 1200 bash -c '
|
|
||||||
until curl -s localhost:8000/generate_stream > /dev/null; do
|
|
||||||
sleep 1
|
|
||||||
done' && return 0 || return 1
|
|
||||||
}
|
|
||||||
|
|
||||||
run_serving_tests() {
|
|
||||||
# run serving tests using `benchmark_serving.py`
|
|
||||||
# $1: a json file specifying serving test cases
|
|
||||||
|
|
||||||
local serving_test_file
|
|
||||||
serving_test_file=$1
|
|
||||||
|
|
||||||
# Iterate over serving tests
|
|
||||||
jq -c '.[]' "$serving_test_file" | while read -r params; do
|
|
||||||
# get the test name, and append the GPU type back to it.
|
|
||||||
test_name=$(echo "$params" | jq -r '.test_name')
|
|
||||||
|
|
||||||
# if TEST_SELECTOR is set, only run the test cases that match the selector
|
|
||||||
if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
|
|
||||||
echo "Skip test case $test_name."
|
|
||||||
continue
|
|
||||||
fi
|
|
||||||
|
|
||||||
# append trt to the test name
|
|
||||||
test_name=trt_$test_name
|
|
||||||
|
|
||||||
# get common parameters
|
|
||||||
common_params=$(echo "$params" | jq -r '.common_parameters')
|
|
||||||
model=$(echo "$common_params" | jq -r '.model')
|
|
||||||
tp=$(echo "$common_params" | jq -r '.tp')
|
|
||||||
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
|
|
||||||
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
|
|
||||||
port=$(echo "$common_params" | jq -r '.port')
|
|
||||||
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
|
|
||||||
|
|
||||||
# get client and server arguments
|
|
||||||
server_params=$(echo "$params" | jq -r '.trt_server_parameters')
|
|
||||||
client_params=$(echo "$params" | jq -r '.trt_client_parameters')
|
|
||||||
client_args=$(json2args "$client_params")
|
|
||||||
qps_list=$(echo "$params" | jq -r '.qps_list')
|
|
||||||
qps_list=$(echo "$qps_list" | jq -r '.[] | @sh')
|
|
||||||
echo "Running over qps list $qps_list"
|
|
||||||
|
|
||||||
# check if there is enough GPU to run the test
|
|
||||||
if [[ $gpu_count -lt $tp ]]; then
|
|
||||||
echo "Required model_tp_size $tp but only $gpu_count GPU found. Skip testcase $test_name."
|
|
||||||
continue
|
|
||||||
fi
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
cd $VLLM_SOURCE_CODE_LOC/benchmarks
|
|
||||||
|
|
||||||
|
|
||||||
echo "Running test case $test_name"
|
|
||||||
bash ../.buildkite/nightly-benchmarks/scripts/launch-trt-server.sh "$server_params" "$common_params"
|
|
||||||
|
|
||||||
# wait until the server is alive
|
|
||||||
wait_for_server
|
|
||||||
if [ $? -eq 0 ]; then
|
|
||||||
echo ""
|
|
||||||
echo "trt server is up and running."
|
|
||||||
else
|
|
||||||
echo ""
|
|
||||||
echo "trt failed to start within the timeout period."
|
|
||||||
break
|
|
||||||
fi
|
|
||||||
|
|
||||||
# prepare tokenizer
|
|
||||||
cd $VLLM_SOURCE_CODE_LOC/benchmarks
|
|
||||||
rm -rf /tokenizer_cache
|
|
||||||
mkdir /tokenizer_cache
|
|
||||||
python ../.buildkite/nightly-benchmarks/scripts/download-tokenizer.py \
|
|
||||||
--model "$model" \
|
|
||||||
--cachedir /tokenizer_cache
|
|
||||||
cd $VLLM_SOURCE_CODE_LOC/benchmarks
|
|
||||||
|
|
||||||
|
|
||||||
# iterate over different QPS
|
|
||||||
for qps in $qps_list; do
|
|
||||||
# remove the surrounding single quote from qps
|
|
||||||
if [[ "$qps" == *"inf"* ]]; then
|
|
||||||
echo "qps was $qps"
|
|
||||||
qps="inf"
|
|
||||||
echo "now qps is $qps"
|
|
||||||
fi
|
|
||||||
|
|
||||||
new_test_name=$test_name"_qps_"$qps
|
|
||||||
|
|
||||||
client_command="python3 benchmark_serving.py \
|
|
||||||
--backend tensorrt-llm \
|
|
||||||
--tokenizer /tokenizer_cache \
|
|
||||||
--model $model \
|
|
||||||
--dataset-name $dataset_name \
|
|
||||||
--dataset-path $dataset_path \
|
|
||||||
--num-prompts $num_prompts \
|
|
||||||
--port $port \
|
|
||||||
--save-result \
|
|
||||||
--result-dir $RESULTS_FOLDER \
|
|
||||||
--result-filename ${new_test_name}.json \
|
|
||||||
--request-rate $qps \
|
|
||||||
$client_args"
|
|
||||||
|
|
||||||
echo "Running test case $test_name with qps $qps"
|
|
||||||
echo "Client command: $client_command"
|
|
||||||
|
|
||||||
eval "$client_command"
|
|
||||||
|
|
||||||
server_command=""
|
|
||||||
# record the benchmarking commands
|
|
||||||
jq_output=$(jq -n \
|
|
||||||
--arg server "$server_command" \
|
|
||||||
--arg client "$client_command" \
|
|
||||||
--arg gpu "$gpu_type" \
|
|
||||||
--arg engine "trt" \
|
|
||||||
'{
|
|
||||||
server_command: $server,
|
|
||||||
client_command: $client,
|
|
||||||
gpu_type: $gpu,
|
|
||||||
engine: $engine
|
|
||||||
}')
|
|
||||||
echo "$jq_output" >"$RESULTS_FOLDER/${new_test_name}.commands"
|
|
||||||
|
|
||||||
done
|
|
||||||
|
|
||||||
# clean up
|
|
||||||
kill_gpu_processes
|
|
||||||
rm -rf /root/.cache/huggingface/*
|
|
||||||
done
|
|
||||||
}
|
|
||||||
|
|
||||||
upload_to_buildkite() {
|
|
||||||
# upload the benchmarking results to buildkite
|
|
||||||
|
|
||||||
# if the agent binary is not found, skip uploading the results, exit 0
|
|
||||||
if [ ! -f /workspace/buildkite-agent ]; then
|
|
||||||
echo "buildkite-agent binary not found. Skip uploading the results."
|
|
||||||
return 0
|
|
||||||
fi
|
|
||||||
# /workspace/buildkite-agent annotate --style "success" --context "benchmark-results" --append < $RESULTS_FOLDER/${CURRENT_LLM_SERVING_ENGINE}_nightly_results.md
|
|
||||||
/workspace/buildkite-agent artifact upload "$RESULTS_FOLDER/*"
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
main() {
|
|
||||||
|
|
||||||
check_gpus
|
|
||||||
|
|
||||||
|
|
||||||
# enter vllm directory
|
|
||||||
cd $VLLM_SOURCE_CODE_LOC/benchmarks
|
|
||||||
|
|
||||||
declare -g RESULTS_FOLDER=results/
|
|
||||||
mkdir -p $RESULTS_FOLDER
|
|
||||||
BENCHMARK_ROOT=../.buildkite/nightly-benchmarks/
|
|
||||||
|
|
||||||
# update transformers package, to make sure mixtral tokenizer is available
|
|
||||||
python -m pip install transformers -U
|
|
||||||
|
|
||||||
export CURRENT_LLM_SERVING_ENGINE=trt
|
|
||||||
run_serving_tests $BENCHMARK_ROOT/tests/nightly-tests.json
|
|
||||||
python -m pip install tabulate pandas
|
|
||||||
python $BENCHMARK_ROOT/scripts/summary-nightly-results.py
|
|
||||||
upload_to_buildkite
|
|
||||||
|
|
||||||
}
|
|
||||||
|
|
||||||
main "$@"
|
|
@ -1,221 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
check_gpus() {
|
|
||||||
# check the number of GPUs and GPU type.
|
|
||||||
declare -g gpu_count=$(nvidia-smi --list-gpus | wc -l)
|
|
||||||
if [[ $gpu_count -gt 0 ]]; then
|
|
||||||
echo "GPU found."
|
|
||||||
else
|
|
||||||
echo "Need at least 1 GPU to run benchmarking."
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
declare -g gpu_type=$(echo $(nvidia-smi --query-gpu=name --format=csv,noheader) | awk '{print $2}')
|
|
||||||
echo "GPU type is $gpu_type"
|
|
||||||
}
|
|
||||||
|
|
||||||
kill_gpu_processes() {
|
|
||||||
# kill all processes on GPU.
|
|
||||||
pkill pt_main_thread
|
|
||||||
sleep 10
|
|
||||||
|
|
||||||
# remove vllm config file
|
|
||||||
rm -rf ~/.config/vllm
|
|
||||||
|
|
||||||
# Print the GPU memory usage
|
|
||||||
# so that we know if all GPU processes are killed.
|
|
||||||
gpu_memory_usage=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -i 0)
|
|
||||||
# The memory usage should be 0 MB.
|
|
||||||
echo "GPU 0 Memory Usage: $gpu_memory_usage MB"
|
|
||||||
}
|
|
||||||
|
|
||||||
json2args() {
|
|
||||||
# transforms the JSON string to command line args, and '_' is replaced to '-'
|
|
||||||
# example:
|
|
||||||
# input: { "model": "meta-llama/Llama-2-7b-chat-hf", "tensor_parallel_size": 1 }
|
|
||||||
# output: --model meta-llama/Llama-2-7b-chat-hf --tensor-parallel-size 1
|
|
||||||
local json_string=$1
|
|
||||||
local args=$(
|
|
||||||
echo "$json_string" | jq -r '
|
|
||||||
to_entries |
|
|
||||||
map("--" + (.key | gsub("_"; "-")) + " " + (.value | tostring)) |
|
|
||||||
join(" ")
|
|
||||||
'
|
|
||||||
)
|
|
||||||
echo "$args"
|
|
||||||
}
|
|
||||||
|
|
||||||
wait_for_server() {
|
|
||||||
# wait for vllm server to start
|
|
||||||
# return 1 if vllm server crashes
|
|
||||||
timeout 1200 bash -c '
|
|
||||||
until curl -s localhost:8000/v1/completions > /dev/null; do
|
|
||||||
sleep 1
|
|
||||||
done' && return 0 || return 1
|
|
||||||
}
|
|
||||||
|
|
||||||
run_serving_tests() {
|
|
||||||
# run serving tests using `benchmark_serving.py`
|
|
||||||
# $1: a json file specifying serving test cases
|
|
||||||
|
|
||||||
local serving_test_file
|
|
||||||
serving_test_file=$1
|
|
||||||
|
|
||||||
# Iterate over serving tests
|
|
||||||
jq -c '.[]' "$serving_test_file" | while read -r params; do
|
|
||||||
# get the test name, and append the GPU type back to it.
|
|
||||||
test_name=$(echo "$params" | jq -r '.test_name')
|
|
||||||
|
|
||||||
# if TEST_SELECTOR is set, only run the test cases that match the selector
|
|
||||||
if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
|
|
||||||
echo "Skip test case $test_name."
|
|
||||||
continue
|
|
||||||
fi
|
|
||||||
|
|
||||||
# append vllm to the test name
|
|
||||||
test_name=vllm_$test_name
|
|
||||||
|
|
||||||
|
|
||||||
# get common parameters
|
|
||||||
common_params=$(echo "$params" | jq -r '.common_parameters')
|
|
||||||
model=$(echo "$common_params" | jq -r '.model')
|
|
||||||
tp=$(echo "$common_params" | jq -r '.tp')
|
|
||||||
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
|
|
||||||
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
|
|
||||||
port=$(echo "$common_params" | jq -r '.port')
|
|
||||||
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
|
|
||||||
|
|
||||||
# get client and server arguments
|
|
||||||
server_params=$(echo "$params" | jq -r '.vllm_server_parameters')
|
|
||||||
client_params=$(echo "$params" | jq -r '.vllm_client_parameters')
|
|
||||||
server_args=$(json2args "$server_params")
|
|
||||||
client_args=$(json2args "$client_params")
|
|
||||||
qps_list=$(echo "$params" | jq -r '.qps_list')
|
|
||||||
qps_list=$(echo "$qps_list" | jq -r '.[] | @sh')
|
|
||||||
echo "Running over qps list $qps_list"
|
|
||||||
|
|
||||||
# check if there is enough GPU to run the test
|
|
||||||
if [[ $gpu_count -lt $tp ]]; then
|
|
||||||
echo "Required tensor-parallel-size $tp but only $gpu_count GPU found. Skip testcase $test_name."
|
|
||||||
continue
|
|
||||||
fi
|
|
||||||
|
|
||||||
if echo "$common_params" | jq -e 'has("fp8")' > /dev/null; then
|
|
||||||
echo "Key 'fp8' exists in common params. Use neuralmagic fp8 model for convenience."
|
|
||||||
model=$(echo "$common_params" | jq -r '.neuralmagic_quantized_model')
|
|
||||||
server_command="python3 \
|
|
||||||
-m vllm.entrypoints.openai.api_server \
|
|
||||||
-tp $tp \
|
|
||||||
--model $model \
|
|
||||||
--port $port \
|
|
||||||
$server_args"
|
|
||||||
else
|
|
||||||
echo "Key 'fp8' does not exist in common params."
|
|
||||||
server_command="python3 \
|
|
||||||
-m vllm.entrypoints.openai.api_server \
|
|
||||||
-tp $tp \
|
|
||||||
--model $model \
|
|
||||||
--port $port \
|
|
||||||
$server_args"
|
|
||||||
fi
|
|
||||||
|
|
||||||
# run the server
|
|
||||||
echo "Running test case $test_name"
|
|
||||||
echo "Server command: $server_command"
|
|
||||||
eval "$server_command" &
|
|
||||||
|
|
||||||
# wait until the server is alive
|
|
||||||
wait_for_server
|
|
||||||
if [ $? -eq 0 ]; then
|
|
||||||
echo ""
|
|
||||||
echo "vllm server is up and running."
|
|
||||||
else
|
|
||||||
echo ""
|
|
||||||
echo "vllm failed to start within the timeout period."
|
|
||||||
break
|
|
||||||
fi
|
|
||||||
|
|
||||||
# iterate over different QPS
|
|
||||||
for qps in $qps_list; do
|
|
||||||
# remove the surrounding single quote from qps
|
|
||||||
if [[ "$qps" == *"inf"* ]]; then
|
|
||||||
echo "qps was $qps"
|
|
||||||
qps="inf"
|
|
||||||
echo "now qps is $qps"
|
|
||||||
fi
|
|
||||||
|
|
||||||
new_test_name=$test_name"_qps_"$qps
|
|
||||||
|
|
||||||
client_command="python3 benchmark_serving.py \
|
|
||||||
--backend vllm \
|
|
||||||
--model $model \
|
|
||||||
--dataset-name $dataset_name \
|
|
||||||
--dataset-path $dataset_path \
|
|
||||||
--num-prompts $num_prompts \
|
|
||||||
--port $port \
|
|
||||||
--save-result \
|
|
||||||
--result-dir $RESULTS_FOLDER \
|
|
||||||
--result-filename ${new_test_name}.json \
|
|
||||||
--request-rate $qps \
|
|
||||||
$client_args"
|
|
||||||
|
|
||||||
echo "Running test case $test_name with qps $qps"
|
|
||||||
echo "Client command: $client_command"
|
|
||||||
|
|
||||||
eval "$client_command"
|
|
||||||
|
|
||||||
# record the benchmarking commands
|
|
||||||
jq_output=$(jq -n \
|
|
||||||
--arg server "$server_command" \
|
|
||||||
--arg client "$client_command" \
|
|
||||||
--arg gpu "$gpu_type" \
|
|
||||||
--arg engine "vllm" \
|
|
||||||
'{
|
|
||||||
server_command: $server,
|
|
||||||
client_command: $client,
|
|
||||||
gpu_type: $gpu,
|
|
||||||
engine: $engine
|
|
||||||
}')
|
|
||||||
echo "$jq_output" >"$RESULTS_FOLDER/${new_test_name}.commands"
|
|
||||||
|
|
||||||
done
|
|
||||||
|
|
||||||
# clean up
|
|
||||||
kill_gpu_processes
|
|
||||||
rm -rf /root/.cache/huggingface/*
|
|
||||||
done
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
upload_to_buildkite() {
|
|
||||||
# upload the benchmarking results to buildkite
|
|
||||||
|
|
||||||
# if the agent binary is not found, skip uploading the results, exit 0
|
|
||||||
if [ ! -f /workspace/buildkite-agent ]; then
|
|
||||||
echo "buildkite-agent binary not found. Skip uploading the results."
|
|
||||||
return 0
|
|
||||||
fi
|
|
||||||
# /workspace/buildkite-agent annotate --style "success" --context "benchmark-results" --append < $RESULTS_FOLDER/${CURRENT_LLM_SERVING_ENGINE}_nightly_results.md
|
|
||||||
/workspace/buildkite-agent artifact upload "$RESULTS_FOLDER/*"
|
|
||||||
}
|
|
||||||
|
|
||||||
main() {
|
|
||||||
|
|
||||||
check_gpus
|
|
||||||
# enter vllm directory
|
|
||||||
cd $VLLM_SOURCE_CODE_LOC/benchmarks
|
|
||||||
declare -g RESULTS_FOLDER=results/
|
|
||||||
mkdir -p $RESULTS_FOLDER
|
|
||||||
BENCHMARK_ROOT=../.buildkite/nightly-benchmarks/
|
|
||||||
|
|
||||||
export CURRENT_LLM_SERVING_ENGINE=vllm
|
|
||||||
run_serving_tests $BENCHMARK_ROOT/tests/nightly-tests.json
|
|
||||||
|
|
||||||
python3 -m pip install tabulate pandas
|
|
||||||
python3 $BENCHMARK_ROOT/scripts/summary-nightly-results.py
|
|
||||||
upload_to_buildkite
|
|
||||||
|
|
||||||
}
|
|
||||||
|
|
||||||
main "$@"
|
|
@ -17,10 +17,17 @@ serving_column_mapping = {
|
|||||||
"request_throughput": "Tput (req/s)",
|
"request_throughput": "Tput (req/s)",
|
||||||
"mean_ttft_ms": "Mean TTFT (ms)",
|
"mean_ttft_ms": "Mean TTFT (ms)",
|
||||||
"std_ttft_ms": "Std TTFT (ms)",
|
"std_ttft_ms": "Std TTFT (ms)",
|
||||||
|
"median_ttft_ms": "Median TTFT (ms)",
|
||||||
"mean_itl_ms": "Mean ITL (ms)",
|
"mean_itl_ms": "Mean ITL (ms)",
|
||||||
"std_itl_ms": "Std ITL (ms)",
|
"std_itl_ms": "Std ITL (ms)",
|
||||||
"input_throughput": "Input Tput (tok/s)",
|
"median_itl_ms": "Median ITL (ms)",
|
||||||
|
"mean_tpot_ms": "Mean TPOT (ms)",
|
||||||
|
"std_tpot_ms": "Std TPOT (ms)",
|
||||||
|
"median_tpot_ms": "Median TPOT (ms)",
|
||||||
|
"total_token_throughput": "Total Token Tput (tok/s)",
|
||||||
"output_throughput": "Output Tput (tok/s)",
|
"output_throughput": "Output Tput (tok/s)",
|
||||||
|
"total_input_tokens": "Total input tokens",
|
||||||
|
"total_output_tokens": "Total output tokens",
|
||||||
"engine": "Engine",
|
"engine": "Engine",
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@ -1,16 +1,18 @@
|
|||||||
[
|
[
|
||||||
{
|
{
|
||||||
"test_name": "llama8B_tp1",
|
"test_name": "llama8B_tp1_sharegpt",
|
||||||
"qps_list": [4],
|
"qps_list": [4,8,16,32,"inf"],
|
||||||
"common_parameters": {
|
"common_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3-8B",
|
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
|
||||||
"tp": 1,
|
"tp": 1,
|
||||||
"dataset_name": "sharegpt",
|
"dataset_name": "sharegpt",
|
||||||
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
||||||
"num_prompts": 500,
|
"num_prompts": 500,
|
||||||
"port": 8000
|
"port": 8000,
|
||||||
|
"reuse_server": false
|
||||||
},
|
},
|
||||||
"lmdeploy_server_parameters": {
|
"lmdeploy_server_parameters": {
|
||||||
|
"dtype": "bfloat16"
|
||||||
},
|
},
|
||||||
"lmdeploy_client_parameters": {
|
"lmdeploy_client_parameters": {
|
||||||
},
|
},
|
||||||
@ -21,34 +23,158 @@
|
|||||||
},
|
},
|
||||||
"trt_server_parameters": {
|
"trt_server_parameters": {
|
||||||
"model_type": "llama",
|
"model_type": "llama",
|
||||||
"model_dtype": "float16",
|
"model_dtype": "bfloat16",
|
||||||
"max_batch_size": 256,
|
"max_batch_size": 2048,
|
||||||
"max_input_len": 4096,
|
"max_input_len": 4096,
|
||||||
"max_output_len": 4096,
|
"max_seq_len": 6144,
|
||||||
"trt_llm_version": "r24.04"
|
"max_num_tokens": 16384,
|
||||||
|
"trt_llm_version": "v0.11.0"
|
||||||
},
|
},
|
||||||
"trt_client_parameters": {
|
"trt_client_parameters": {
|
||||||
"endpoint": "/v2/models/ensemble/generate_stream"
|
"endpoint": "/v2/models/ensemble/generate_stream"
|
||||||
},
|
},
|
||||||
"vllm_server_parameters": {
|
"vllm_server_parameters": {
|
||||||
"disable_log_stats": "",
|
"disable_log_stats": "",
|
||||||
"disable_log_requests": ""
|
"disable_log_requests": "",
|
||||||
|
"gpu_memory_utilization": 0.9,
|
||||||
|
"num_scheduler_steps": 10,
|
||||||
|
"max_num_seqs": 512,
|
||||||
|
"dtype": "bfloat16"
|
||||||
},
|
},
|
||||||
"vllm_client_parameters": {
|
"vllm_client_parameters": {
|
||||||
|
},
|
||||||
|
"sglang_server_parameters": {
|
||||||
|
"disable_radix_cache": "",
|
||||||
|
"enable_torch_compile": "",
|
||||||
|
"dtype": "bfloat16"
|
||||||
|
},
|
||||||
|
"sglang_client_parameters": {
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"test_name": "llama70B_tp4",
|
"test_name": "llama8B_tp1_sonnet_512_16",
|
||||||
"qps_list": [2],
|
"qps_list": [4,8,16,32,"inf"],
|
||||||
|
"common_parameters": {
|
||||||
|
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
|
||||||
|
"tp": 1,
|
||||||
|
"dataset_name": "sonnet",
|
||||||
|
"dataset_path": "./sonnet_4x.txt",
|
||||||
|
"num_prompts": 500,
|
||||||
|
"port": 8000,
|
||||||
|
"sonnet_input_len": 512,
|
||||||
|
"sonnet_output_len": 16,
|
||||||
|
"sonnet_prefix_len": 50,
|
||||||
|
"reuse_server": true
|
||||||
|
},
|
||||||
|
"lmdeploy_server_parameters": {
|
||||||
|
"dtype": "bfloat16"
|
||||||
|
},
|
||||||
|
"lmdeploy_client_parameters": {
|
||||||
|
},
|
||||||
|
"tgi_server_parameters": {
|
||||||
|
},
|
||||||
|
"tgi_client_parameters": {
|
||||||
|
"endpoint": "/generate_stream"
|
||||||
|
},
|
||||||
|
"trt_server_parameters": {
|
||||||
|
"model_type": "llama",
|
||||||
|
"model_dtype": "bfloat16",
|
||||||
|
"max_batch_size": 2048,
|
||||||
|
"max_input_len": 4096,
|
||||||
|
"max_seq_len": 6144,
|
||||||
|
"max_num_tokens": 16384,
|
||||||
|
"trt_llm_version": "v0.11.0"
|
||||||
|
},
|
||||||
|
"trt_client_parameters": {
|
||||||
|
"endpoint": "/v2/models/ensemble/generate_stream"
|
||||||
|
},
|
||||||
|
"vllm_server_parameters": {
|
||||||
|
"disable_log_stats": "",
|
||||||
|
"disable_log_requests": "",
|
||||||
|
"gpu_memory_utilization": 0.9,
|
||||||
|
"num_scheduler_steps": 10,
|
||||||
|
"max_num_seqs": 512,
|
||||||
|
"dtype": "bfloat16"
|
||||||
|
},
|
||||||
|
"vllm_client_parameters": {
|
||||||
|
},
|
||||||
|
"sglang_server_parameters": {
|
||||||
|
"disable_radix_cache": "",
|
||||||
|
"enable_torch_compile": "",
|
||||||
|
"dtype": "bfloat16"
|
||||||
|
},
|
||||||
|
"sglang_client_parameters": {
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"test_name": "llama8B_tp1_sonnet_512_256",
|
||||||
|
"qps_list": [4,8,16,32,"inf"],
|
||||||
|
"common_parameters": {
|
||||||
|
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
|
||||||
|
"tp": 1,
|
||||||
|
"dataset_name": "sonnet",
|
||||||
|
"dataset_path": "./sonnet_4x.txt",
|
||||||
|
"num_prompts": 500,
|
||||||
|
"port": 8000,
|
||||||
|
"sonnet_input_len": 512,
|
||||||
|
"sonnet_output_len": 256,
|
||||||
|
"sonnet_prefix_len": 50,
|
||||||
|
"reuse_server": true
|
||||||
|
},
|
||||||
|
"lmdeploy_server_parameters": {
|
||||||
|
"dtype": "bfloat16"
|
||||||
|
},
|
||||||
|
"lmdeploy_client_parameters": {
|
||||||
|
},
|
||||||
|
"tgi_server_parameters": {
|
||||||
|
},
|
||||||
|
"tgi_client_parameters": {
|
||||||
|
"endpoint": "/generate_stream"
|
||||||
|
},
|
||||||
|
"trt_server_parameters": {
|
||||||
|
"model_type": "llama",
|
||||||
|
"model_dtype": "bfloat16",
|
||||||
|
"max_batch_size": 2048,
|
||||||
|
"max_input_len": 4096,
|
||||||
|
"max_seq_len": 6144,
|
||||||
|
"max_num_tokens": 16384,
|
||||||
|
"trt_llm_version": "v0.11.0"
|
||||||
|
},
|
||||||
|
"trt_client_parameters": {
|
||||||
|
"endpoint": "/v2/models/ensemble/generate_stream"
|
||||||
|
},
|
||||||
|
"vllm_server_parameters": {
|
||||||
|
"disable_log_stats": "",
|
||||||
|
"disable_log_requests": "",
|
||||||
|
"gpu_memory_utilization": 0.9,
|
||||||
|
"num_scheduler_steps": 10,
|
||||||
|
"max_num_seqs": 512,
|
||||||
|
"dtype": "bfloat16"
|
||||||
|
},
|
||||||
|
"vllm_client_parameters": {
|
||||||
|
},
|
||||||
|
"sglang_server_parameters": {
|
||||||
|
"disable_radix_cache": "",
|
||||||
|
"enable_torch_compile": "",
|
||||||
|
"dtype": "bfloat16"
|
||||||
|
},
|
||||||
|
"sglang_client_parameters": {
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"test_name": "llama70B_tp4_sharegpt",
|
||||||
|
"qps_list": [4,8,16,32,"inf"],
|
||||||
"common_parameters": {
|
"common_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3-70B-Instruct",
|
"model": "meta-llama/Meta-Llama-3-70B-Instruct",
|
||||||
"tp": 4,
|
"tp": 4,
|
||||||
"dataset_name": "sharegpt",
|
"dataset_name": "sharegpt",
|
||||||
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
||||||
"num_prompts": 500,
|
"num_prompts": 500,
|
||||||
"port": 8000
|
"port": 8000,
|
||||||
|
"reuse_server": false
|
||||||
},
|
},
|
||||||
"lmdeploy_server_parameters": {
|
"lmdeploy_server_parameters": {
|
||||||
|
"dtype": "bfloat16"
|
||||||
},
|
},
|
||||||
"lmdeploy_client_parameters": {
|
"lmdeploy_client_parameters": {
|
||||||
},
|
},
|
||||||
@ -59,34 +185,50 @@
|
|||||||
},
|
},
|
||||||
"trt_server_parameters": {
|
"trt_server_parameters": {
|
||||||
"model_type": "llama",
|
"model_type": "llama",
|
||||||
"model_dtype": "float16",
|
"model_dtype": "bfloat16",
|
||||||
"max_batch_size": 256,
|
"max_batch_size": 2048,
|
||||||
"max_input_len": 4096,
|
"max_input_len": 4096,
|
||||||
"max_output_len": 4096,
|
"max_seq_len": 6144,
|
||||||
"trt_llm_version": "r24.04"
|
"max_num_tokens": 16384,
|
||||||
|
"trt_llm_version": "v0.11.0"
|
||||||
},
|
},
|
||||||
"trt_client_parameters": {
|
"trt_client_parameters": {
|
||||||
"endpoint": "/v2/models/ensemble/generate_stream"
|
"endpoint": "/v2/models/ensemble/generate_stream"
|
||||||
},
|
},
|
||||||
"vllm_server_parameters": {
|
"vllm_server_parameters": {
|
||||||
"disable_log_stats": "",
|
"disable_log_stats": "",
|
||||||
"disable_log_requests": ""
|
"disable_log_requests": "",
|
||||||
|
"gpu_memory_utilization": 0.9,
|
||||||
|
"num_scheduler_steps": 10,
|
||||||
|
"max_num_seqs": 512,
|
||||||
|
"dtype": "bfloat16"
|
||||||
},
|
},
|
||||||
"vllm_client_parameters": {
|
"vllm_client_parameters": {
|
||||||
|
},
|
||||||
|
"sglang_server_parameters": {
|
||||||
|
"disable_radix_cache": "",
|
||||||
|
"dtype": "bfloat16"
|
||||||
|
},
|
||||||
|
"sglang_client_parameters": {
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"test_name": "mixtral8x7B_tp2",
|
"test_name": "llama70B_tp4_sonnet_512_16",
|
||||||
"qps_list": [2],
|
"qps_list": [4,8,16,32,"inf"],
|
||||||
"common_parameters": {
|
"common_parameters": {
|
||||||
"model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
|
"model": "meta-llama/Meta-Llama-3-70B-Instruct",
|
||||||
"tp": 2,
|
"tp": 4,
|
||||||
"dataset_name": "sharegpt",
|
"dataset_name": "sonnet",
|
||||||
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
"dataset_path": "./sonnet_4x.txt",
|
||||||
"num_prompts": 500,
|
"num_prompts": 500,
|
||||||
"port": 8000
|
"port": 8000,
|
||||||
|
"sonnet_input_len": 512,
|
||||||
|
"sonnet_output_len": 16,
|
||||||
|
"sonnet_prefix_len": 50,
|
||||||
|
"reuse_server": true
|
||||||
},
|
},
|
||||||
"lmdeploy_server_parameters": {
|
"lmdeploy_server_parameters": {
|
||||||
|
"dtype": "bfloat16"
|
||||||
},
|
},
|
||||||
"lmdeploy_client_parameters": {
|
"lmdeploy_client_parameters": {
|
||||||
},
|
},
|
||||||
@ -97,20 +239,85 @@
|
|||||||
},
|
},
|
||||||
"trt_server_parameters": {
|
"trt_server_parameters": {
|
||||||
"model_type": "llama",
|
"model_type": "llama",
|
||||||
"model_dtype": "float16",
|
"model_dtype": "bfloat16",
|
||||||
"max_batch_size": 256,
|
"max_batch_size": 2048,
|
||||||
"max_input_len": 4096,
|
"max_input_len": 4096,
|
||||||
"max_output_len": 4096,
|
"max_seq_len": 6144,
|
||||||
"trt_llm_version": "r24.04"
|
"max_num_tokens": 16384,
|
||||||
|
"trt_llm_version": "v0.11.0"
|
||||||
},
|
},
|
||||||
"trt_client_parameters": {
|
"trt_client_parameters": {
|
||||||
"endpoint": "/v2/models/ensemble/generate_stream"
|
"endpoint": "/v2/models/ensemble/generate_stream"
|
||||||
},
|
},
|
||||||
"vllm_server_parameters": {
|
"vllm_server_parameters": {
|
||||||
"disable_log_stats": "",
|
"disable_log_stats": "",
|
||||||
"disable_log_requests": ""
|
"disable_log_requests": "",
|
||||||
|
"gpu_memory_utilization": 0.9,
|
||||||
|
"num_scheduler_steps": 10,
|
||||||
|
"max_num_seqs": 512,
|
||||||
|
"dtype": "bfloat16"
|
||||||
},
|
},
|
||||||
"vllm_client_parameters": {
|
"vllm_client_parameters": {
|
||||||
|
},
|
||||||
|
"sglang_server_parameters": {
|
||||||
|
"disable_radix_cache": "",
|
||||||
|
"dtype": "bfloat16"
|
||||||
|
},
|
||||||
|
"sglang_client_parameters": {
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"test_name": "llama70B_tp4_sonnet_512_256",
|
||||||
|
"qps_list": [4,8,16,32,"inf"],
|
||||||
|
"common_parameters": {
|
||||||
|
"model": "meta-llama/Meta-Llama-3-70B-Instruct",
|
||||||
|
"tp": 4,
|
||||||
|
"dataset_name": "sonnet",
|
||||||
|
"dataset_path": "./sonnet_4x.txt",
|
||||||
|
"num_prompts": 500,
|
||||||
|
"port": 8000,
|
||||||
|
"sonnet_input_len": 512,
|
||||||
|
"sonnet_output_len": 256,
|
||||||
|
"sonnet_prefix_len": 50,
|
||||||
|
"reuse_server": true
|
||||||
|
},
|
||||||
|
"lmdeploy_server_parameters": {
|
||||||
|
"dtype": "bfloat16"
|
||||||
|
},
|
||||||
|
"lmdeploy_client_parameters": {
|
||||||
|
},
|
||||||
|
"tgi_server_parameters": {
|
||||||
|
},
|
||||||
|
"tgi_client_parameters": {
|
||||||
|
"endpoint": "/generate_stream"
|
||||||
|
},
|
||||||
|
"trt_server_parameters": {
|
||||||
|
"model_type": "llama",
|
||||||
|
"model_dtype": "bfloat16",
|
||||||
|
"max_batch_size": 2048,
|
||||||
|
"max_input_len": 4096,
|
||||||
|
"max_seq_len": 6144,
|
||||||
|
"max_num_tokens": 16384,
|
||||||
|
"trt_llm_version": "v0.11.0"
|
||||||
|
},
|
||||||
|
"trt_client_parameters": {
|
||||||
|
"endpoint": "/v2/models/ensemble/generate_stream"
|
||||||
|
},
|
||||||
|
"vllm_server_parameters": {
|
||||||
|
"disable_log_stats": "",
|
||||||
|
"disable_log_requests": "",
|
||||||
|
"gpu_memory_utilization": 0.9,
|
||||||
|
"num_scheduler_steps": 10,
|
||||||
|
"max_num_seqs": 512,
|
||||||
|
"dtype": "bfloat16"
|
||||||
|
},
|
||||||
|
"vllm_client_parameters": {
|
||||||
|
},
|
||||||
|
"sglang_server_parameters": {
|
||||||
|
"disable_radix_cache": "",
|
||||||
|
"dtype": "bfloat16"
|
||||||
|
},
|
||||||
|
"sglang_client_parameters": {
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
]
|
]
|
@ -26,6 +26,7 @@ class RequestFuncInput:
|
|||||||
use_beam_search: bool = False
|
use_beam_search: bool = False
|
||||||
logprobs: Optional[int] = None
|
logprobs: Optional[int] = None
|
||||||
multi_modal_content: Optional[dict] = None
|
multi_modal_content: Optional[dict] = None
|
||||||
|
ignore_eos: bool = False
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
@dataclass
|
||||||
@ -55,6 +56,7 @@ async def async_request_tgi(
|
|||||||
"do_sample": True,
|
"do_sample": True,
|
||||||
"temperature": 0.01, # TGI does not accept 0.0 temperature.
|
"temperature": 0.01, # TGI does not accept 0.0 temperature.
|
||||||
"top_p": 0.99, # TGI does not accept 1.0 top_p.
|
"top_p": 0.99, # TGI does not accept 1.0 top_p.
|
||||||
|
# TGI does not accept ignore_eos flag.
|
||||||
}
|
}
|
||||||
payload = {
|
payload = {
|
||||||
"inputs": request_func_input.prompt,
|
"inputs": request_func_input.prompt,
|
||||||
@ -129,6 +131,8 @@ async def async_request_trt_llm(
|
|||||||
"max_tokens": request_func_input.output_len,
|
"max_tokens": request_func_input.output_len,
|
||||||
"stream": True,
|
"stream": True,
|
||||||
}
|
}
|
||||||
|
if request_func_input.ignore_eos:
|
||||||
|
payload["min_length"] = request_func_input.output_len
|
||||||
output = RequestFuncOutput()
|
output = RequestFuncOutput()
|
||||||
output.prompt_len = request_func_input.prompt_len
|
output.prompt_len = request_func_input.prompt_len
|
||||||
|
|
||||||
@ -240,6 +244,7 @@ async def async_request_openai_completions(
|
|||||||
"max_tokens": request_func_input.output_len,
|
"max_tokens": request_func_input.output_len,
|
||||||
"logprobs": request_func_input.logprobs,
|
"logprobs": request_func_input.logprobs,
|
||||||
"stream": True,
|
"stream": True,
|
||||||
|
"ignore_eos": request_func_input.ignore_eos,
|
||||||
}
|
}
|
||||||
headers = {
|
headers = {
|
||||||
"Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}"
|
"Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}"
|
||||||
@ -327,6 +332,7 @@ async def async_request_openai_chat_completions(
|
|||||||
"temperature": 0.0,
|
"temperature": 0.0,
|
||||||
"max_tokens": request_func_input.output_len,
|
"max_tokens": request_func_input.output_len,
|
||||||
"stream": True,
|
"stream": True,
|
||||||
|
"ignore_eos": request_func_input.ignore_eos,
|
||||||
}
|
}
|
||||||
headers = {
|
headers = {
|
||||||
"Content-Type": "application/json",
|
"Content-Type": "application/json",
|
||||||
@ -430,4 +436,5 @@ ASYNC_REQUEST_FUNCS = {
|
|||||||
"openai-chat": async_request_openai_chat_completions,
|
"openai-chat": async_request_openai_chat_completions,
|
||||||
"tensorrt-llm": async_request_trt_llm,
|
"tensorrt-llm": async_request_trt_llm,
|
||||||
"scalellm": async_request_openai_completions,
|
"scalellm": async_request_openai_completions,
|
||||||
|
"sglang": async_request_openai_completions,
|
||||||
}
|
}
|
||||||
|
@ -397,6 +397,7 @@ async def benchmark(
|
|||||||
profile: bool,
|
profile: bool,
|
||||||
selected_percentile_metrics: List[str],
|
selected_percentile_metrics: List[str],
|
||||||
selected_percentiles: List[str],
|
selected_percentiles: List[str],
|
||||||
|
ignore_eos: bool,
|
||||||
):
|
):
|
||||||
if backend in ASYNC_REQUEST_FUNCS:
|
if backend in ASYNC_REQUEST_FUNCS:
|
||||||
request_func = ASYNC_REQUEST_FUNCS[backend]
|
request_func = ASYNC_REQUEST_FUNCS[backend]
|
||||||
@ -420,6 +421,7 @@ async def benchmark(
|
|||||||
best_of=best_of,
|
best_of=best_of,
|
||||||
use_beam_search=use_beam_search,
|
use_beam_search=use_beam_search,
|
||||||
multi_modal_content=test_mm_content,
|
multi_modal_content=test_mm_content,
|
||||||
|
ignore_eos=ignore_eos,
|
||||||
)
|
)
|
||||||
test_output = await request_func(request_func_input=test_input)
|
test_output = await request_func(request_func_input=test_input)
|
||||||
if not test_output.success:
|
if not test_output.success:
|
||||||
@ -685,6 +687,7 @@ def main(args: argparse.Namespace):
|
|||||||
selected_percentiles=[
|
selected_percentiles=[
|
||||||
float(p) for p in args.metric_percentiles.split(",")
|
float(p) for p in args.metric_percentiles.split(",")
|
||||||
],
|
],
|
||||||
|
ignore_eos=args.ignore_eos,
|
||||||
))
|
))
|
||||||
|
|
||||||
# Save config and results to json
|
# Save config and results to json
|
||||||
@ -863,6 +866,11 @@ if __name__ == "__main__":
|
|||||||
"{backend}-{args.request_rate}qps-{base_model_id}-{current_dt}.json"
|
"{backend}-{args.request_rate}qps-{base_model_id}-{current_dt}.json"
|
||||||
" format.",
|
" format.",
|
||||||
)
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--ignore-eos",
|
||||||
|
action="store_true",
|
||||||
|
help="Set ignore_eos flag when sending the benchmark request."
|
||||||
|
"Warning: ignore_eos is not supported in deepspeed_mii and tgi.")
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--percentile-metrics",
|
"--percentile-metrics",
|
||||||
type=str,
|
type=str,
|
||||||
|
Loading…
x
Reference in New Issue
Block a user