20231088/vllm - vllm - Luminance Code Repo

20231088/vllm

Author	SHA1	Message	Date
SangBin Cho	b51c1cc9d2	[2/N] Chunked prefill data update (#3538 )	2024-03-28 10:06:01 -07:00
Roger Wang	45b6ef6513	feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark (#3277 )	2024-03-27 13:39:26 -07:00
AmadeusChan	1956931436	[Misc] add the "download-dir" option to the latency/throughput benchmarks (#3621 )	2024-03-27 13:39:05 -07:00
SangBin Cho	01bfb22b41	[CI] Try introducing isort. (#3495 )	2024-03-25 07:59:47 -07:00
Simon Mo	8e67598aa6	[Misc] fix line length for entire codebase (#3444 )	2024-03-16 00:36:29 -07:00
Ronen Schaffer	14e3f9a1b2	Replace `lstrip()` with `removeprefix()` to fix Ruff linter warning (#2958 )	2024-03-15 21:01:30 -07:00
youkaichao	8fe8386591	[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 (#3389 )	2024-03-14 08:11:48 +00:00
Terry	7e9bd08f60	Add batched RoPE kernel (#3095 )	2024-03-13 13:45:26 -07:00
TianYu GUO	1ece1ae829	[Minor Fix] Fix comments in benchmark_serving (#3252 )	2024-03-07 22:22:59 -08:00
Chen Wang	9a4548bae7	Fix the openai benchmarking requests to work with latest OpenAI apis (#2992 ) Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>	2024-03-04 15:51:56 -08:00
Allen.Dou	9cbc7e5f3b	enable --gpu-memory-utilization in benchmark_throughput.py (#3175 ) Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com>	2024-03-04 10:37:58 -08:00
TianYu GUO	901cf4c52b	[Minor Fix] Remove unused code in benchmark_prefix_caching.py (#3171 )	2024-03-03 22:48:27 -08:00
Philipp Moritz	17c3103c56	Make it easy to profile workers with nsight (#3162 ) Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>	2024-03-03 16:19:13 -08:00
Zhuohan Li	996d095c54	[FIX] Fix styles in automatic prefix caching & add a automatic prefix caching benchmark (#3158 )	2024-03-03 14:37:18 -08:00
Sage Moore	ce4f5a29fb	Add Automatic Prefix Caching (#2762 ) Co-authored-by: ElizaWszola <eliza@neuralmagic.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-03-02 00:50:01 -08:00
Philipp Moritz	cfc15a1031	Optimize Triton MoE Kernel (#2979 ) Co-authored-by: Cade Daniel <edacih@gmail.com>	2024-02-26 13:48:56 -08:00
Massimiliano Pronesti	93dc5a2870	chore(vllm): codespell for spell checking (#2820 )	2024-02-21 18:56:01 -08:00
Ronen Schaffer	d7f396486e	Update comment (#2934 )	2024-02-21 18:18:37 -08:00
Roger Wang	a4211a4dc3	Serving Benchmark Refactoring (#2433 )	2024-02-12 22:53:00 -08:00
Woosuk Kwon	72d3a30c63	[Minor] Fix benchmark_latency script (#2765 )	2024-02-05 12:45:37 -08:00
Kunshang Ji	96b6f475dd	Remove hardcoded `device="cuda"` to support more devices (#2503 ) Co-authored-by: Jiang Li <jiang1.li@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>	2024-02-01 15:46:39 -08:00
zhaoyang-star	9090bf02e7	Support FP8-E5M2 KV Cache (#2279 ) Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2024-01-28 16:43:54 -08:00
Simon Mo	1e4277d2d1	lint: format all python file instead of just source code (#2567 )	2024-01-23 15:53:06 -08:00
Antoni Baum	9b945daaf1	[Experimental] Add multi-LoRA support (#1804 ) Co-authored-by: Chen Shen <scv119@gmail.com> Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com> Co-authored-by: Avnish Narayan <avnish@anyscale.com>	2024-01-23 15:26:37 -08:00
Harry Mellor	63e835cbcc	Fix progress bar and allow HTTPS in `benchmark_serving.py` (#2552 )	2024-01-22 14:40:31 -08:00
Harry Mellor	2709c0009a	Support OpenAI API server in `benchmark_serving.py` (#2172 )	2024-01-18 20:34:08 -08:00
Woosuk Kwon	37ca558103	Optimize model execution with CUDA graph (#1926 ) Co-authored-by: Chen Shen <scv119@gmail.com> Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2023-12-16 21:12:08 -08:00
CHU Tianxiang	0fbfc4b81b	Add GPTQ support (#916 )	2023-12-15 03:04:22 -08:00
Woosuk Kwon	5dd80d3777	Fix latency benchmark script (#2035 )	2023-12-11 11:19:08 -08:00
wbn	dacaf5a400	Replace head_mapping params with num_kv_heads to attention kernel. (#1997 ) Co-authored-by: wangguoya <wangguoya@baidu.com> Co-authored-by: Yang Zhao <zhaoyangstar@foxmail.com>	2023-12-10 10:12:53 -08:00
Antoni Baum	05ff90b692	Save pytorch profiler output for latency benchmark (#1871 ) * Save profiler output * Apply feedback from code review	2023-12-05 20:55:55 -08:00
aisensiy	8d8c2f6ffe	Support max-model-len argument for throughput benchmark (#1858 )	2023-11-30 08:10:24 -08:00
Woosuk Kwon	51d3cb951d	Remove max_num_seqs in latency benchmark script (#1855 )	2023-11-30 00:00:32 -08:00
Woosuk Kwon	e74b1736a1	Add profile option to latency benchmark script (#1839 )	2023-11-29 23:42:52 -08:00
Yanming W	e0c6f556e8	[Build] Avoid building too many extensions (#1624 )	2023-11-23 16:31:19 -08:00
Simon Mo	5ffc0d13a2	Migrate linter from `pylint` to `ruff` (#1665 )	2023-11-20 11:58:01 -08:00
Zhuofan	dcc543a298	[Minor] Fix comment (#1704 )	2023-11-17 09:42:49 -08:00
Woosuk Kwon	660a7fcfa4	Add DeepSpeed MII backend to benchmark script (#1649 )	2023-11-14 12:35:30 -08:00
chooper1	1f24755bf8	Support SqueezeLLM (#1326 ) Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2023-10-21 23:14:59 -07:00
Woosuk Kwon	928de46888	Implement PagedAttention V2 (#1348 )	2023-10-16 00:59:57 -07:00
Antoni Baum	acbed3ef40	Use monotonic time where appropriate (#1249 )	2023-10-02 19:22:05 -07:00
kg6-sleipnir	b5a10eb0ef	Added `dtype` arg to benchmarks (#1228 )	2023-09-30 21:04:03 -07:00
Woosuk Kwon	e3e79e9e8a	Implement AWQ quantization support for LLaMA (#1032 ) Co-authored-by: Robert Irvine <robert@seamlessml.com> Co-authored-by: root <rirv938@gmail.com> Co-authored-by: Casper <casperbh.96@gmail.com> Co-authored-by: julian-q <julianhquevedo@gmail.com>	2023-09-16 00:03:37 -07:00
Ricardo Lu	8c4b2592fb	fix: enable trust-remote-code in api server & benchmark. (#509 )	2023-07-19 17:06:15 -07:00
WRH	cf21a9bd5c	support trust_remote_code in benchmark (#518 )	2023-07-19 17:02:40 -07:00
Woosuk Kwon	4338cc4750	[Tokenizer] Add an option to specify tokenizer (#284 )	2023-06-28 09:46:58 -07:00
Zhuohan Li	43710e8d09	[Fix] Fix default port number in benchmark scripts (#265 )	2023-06-26 13:15:35 -07:00
Zhuohan Li	0370afa2e5	Remove benchmark_async_llm_server.py (#155 )	2023-06-19 11:12:37 +08:00
Woosuk Kwon	3f92038b99	Add comments on swap space (#154 )	2023-06-18 11:39:35 -07:00
Woosuk Kwon	0b98ba15c7	Change the name to vLLM (#150 )	2023-06-17 03:07:40 -07:00

1 2

57 Commits