Yile (Michael) Gu
|
98a42e7078
|
[Benchmark] Change mii to use persistent deployment and support tensor parallel (#3628)
|
2024-03-28 17:33:52 -07:00 |
|
AmadeusChan
|
1956931436
|
[Misc] add the "download-dir" option to the latency/throughput benchmarks (#3621)
|
2024-03-27 13:39:05 -07:00 |
|
SangBin Cho
|
01bfb22b41
|
[CI] Try introducing isort. (#3495)
|
2024-03-25 07:59:47 -07:00 |
|
Allen.Dou
|
9cbc7e5f3b
|
enable --gpu-memory-utilization in benchmark_throughput.py (#3175)
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com>
|
2024-03-04 10:37:58 -08:00 |
|
Zhuohan Li
|
996d095c54
|
[FIX] Fix styles in automatic prefix caching & add a automatic prefix caching benchmark (#3158)
|
2024-03-03 14:37:18 -08:00 |
|
Sage Moore
|
ce4f5a29fb
|
Add Automatic Prefix Caching (#2762)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-03-02 00:50:01 -08:00 |
|
Kunshang Ji
|
96b6f475dd
|
Remove hardcoded device="cuda" to support more devices (#2503)
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
|
2024-02-01 15:46:39 -08:00 |
|
zhaoyang-star
|
9090bf02e7
|
Support FP8-E5M2 KV Cache (#2279)
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2024-01-28 16:43:54 -08:00 |
|
Woosuk Kwon
|
37ca558103
|
Optimize model execution with CUDA graph (#1926)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
|
2023-12-16 21:12:08 -08:00 |
|
CHU Tianxiang
|
0fbfc4b81b
|
Add GPTQ support (#916)
|
2023-12-15 03:04:22 -08:00 |
|
aisensiy
|
8d8c2f6ffe
|
Support max-model-len argument for throughput benchmark (#1858)
|
2023-11-30 08:10:24 -08:00 |
|
Simon Mo
|
5ffc0d13a2
|
Migrate linter from pylint to ruff (#1665)
|
2023-11-20 11:58:01 -08:00 |
|
Zhuofan
|
dcc543a298
|
[Minor] Fix comment (#1704)
|
2023-11-17 09:42:49 -08:00 |
|
Woosuk Kwon
|
660a7fcfa4
|
Add DeepSpeed MII backend to benchmark script (#1649)
|
2023-11-14 12:35:30 -08:00 |
|
chooper1
|
1f24755bf8
|
Support SqueezeLLM (#1326)
Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2023-10-21 23:14:59 -07:00 |
|
Antoni Baum
|
acbed3ef40
|
Use monotonic time where appropriate (#1249)
|
2023-10-02 19:22:05 -07:00 |
|
kg6-sleipnir
|
b5a10eb0ef
|
Added dtype arg to benchmarks (#1228)
|
2023-09-30 21:04:03 -07:00 |
|
Woosuk Kwon
|
e3e79e9e8a
|
Implement AWQ quantization support for LLaMA (#1032)
Co-authored-by: Robert Irvine <robert@seamlessml.com>
Co-authored-by: root <rirv938@gmail.com>
Co-authored-by: Casper <casperbh.96@gmail.com>
Co-authored-by: julian-q <julianhquevedo@gmail.com>
|
2023-09-16 00:03:37 -07:00 |
|
Ricardo Lu
|
8c4b2592fb
|
fix: enable trust-remote-code in api server & benchmark. (#509)
|
2023-07-19 17:06:15 -07:00 |
|
WRH
|
cf21a9bd5c
|
support trust_remote_code in benchmark (#518)
|
2023-07-19 17:02:40 -07:00 |
|
Woosuk Kwon
|
4338cc4750
|
[Tokenizer] Add an option to specify tokenizer (#284)
|
2023-06-28 09:46:58 -07:00 |
|
Woosuk Kwon
|
0b98ba15c7
|
Change the name to vLLM (#150)
|
2023-06-17 03:07:40 -07:00 |
|
Zhuohan Li
|
e5464ee484
|
Rename servers to engines (#152)
|
2023-06-17 17:25:21 +08:00 |
|
Woosuk Kwon
|
bab8f3dd0d
|
[Minor] Fix benchmark_throughput.py (#151)
|
2023-06-16 21:00:52 -07:00 |
|
Woosuk Kwon
|
311490a720
|
Add script for benchmarking serving throughput (#145)
|
2023-06-14 19:55:38 -07:00 |
|
Woosuk Kwon
|
8274ca23ac
|
Add docstrings for LLM (#137)
|
2023-06-04 12:52:41 -07:00 |
|
Woosuk Kwon
|
211318d44a
|
Add throughput benchmarking script (#133)
|
2023-05-28 03:20:05 -07:00 |
|