20231088/vllm - vllm - Luminance Code Repo

20231088/vllm

Author	SHA1	Message	Date
Aarni Koskela	8baa454937	[Misc] Move device options to a single place (#8322 )	2024-09-11 13:25:58 -07:00
Wei-Sheng Chin	795b662cff	Enable Random Prefix Caching in Serving Profiling Tool (benchmark_serving.py) (#8241 )	2024-09-06 20:18:16 -07:00
afeldman-nm	e5cab71531	[Frontend] Add --logprobs argument to `benchmark_serving.py` (#8191 )	2024-09-06 09:01:14 -07:00
Cody Yu	77d9e514a2	[MISC] Replace input token throughput with total token throughput (#8164 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-09-04 20:23:22 +00:00
Nick Hill	d4db9f53c8	[Benchmark] Add `--async-engine` option to benchmark_throughput.py (#7964 )	2024-09-03 20:57:41 -04:00
Wei-Sheng Chin	0c785d344d	Add more percentiles and latencies (#7759 )	2024-08-29 16:48:11 -07:00
Philipp Schmid	345be0e244	[benchmark] Update TGI version (#7917 )	2024-08-27 15:07:53 -07:00
Megha Agarwal	2eedede875	[Core] Asynchronous Output Processor (#7049 ) Co-authored-by: Alexander Matveev <alexm@neuralmagic.com>	2024-08-26 20:53:20 -07:00
Alexander Matveev	9db93de20c	[Core] Add multi-step support to LLMEngine (#7789 )	2024-08-23 12:45:53 -07:00
Jiaxin Shan	d3b5b98021	[Misc] Enhance prefix-caching benchmark tool (#6568 )	2024-08-22 09:32:02 -07:00
Luka Govedič	7937009a7e	[Kernel] Replaced `blockReduce[...]` functions with `cub::BlockReduce` (#7233 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-08-21 20:18:00 -04:00
William Lin	dd53c4b023	[misc] Add Torch profiler support (#7451 ) Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-08-21 15:39:26 -07:00
Lucas Wilkinson	5288c06aa0	[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel (#7174 )	2024-08-20 07:09:33 -06:00
Mor Zusman	7fc23be81c	[Kernel] W8A16 Int8 inside FusedMoE (#7415 )	2024-08-16 10:06:51 -07:00
Roger Wang	70d268a399	[Bugfix] Fix ITL recording in serving benchmark (#7372 )	2024-08-09 10:00:00 -07:00
Luka Govedič	8d59dbb000	[Kernel] Add per-tensor and per-token AZP epilogues (#5941 ) Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2024-08-06 18:17:08 +00:00
Lucas Wilkinson	a8d604ca2a	[Misc] Disambiguate quantized types via a new ScalarType (#6396 )	2024-08-02 13:51:58 -07:00
Varun Sundar Rabindranath	35e9c12bfa	[Kernel] Tuned int8 Cutlass Kernels for SM75 (T4) (#6996 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-07-31 14:40:32 -07:00
Varun Sundar Rabindranath	766435e660	[Kernel] Tuned FP8 Kernels for Ada Lovelace (#6677 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-07-29 09:42:35 -06:00
Alexander Matveev	75acdaa4b6	[Kernel] Increase precision of GPTQ/AWQ Marlin kernel (#6795 )	2024-07-27 17:52:33 -04:00
Joe	14dbd5a767	[Model] H2O Danube3-4b (#6451 )	2024-07-26 20:47:50 -07:00
Cyrus Leung	739b61a348	[Frontend] Refactor prompt processing (#4028 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-07-22 10:13:53 -07:00
Woosuk Kwon	a9a2e74d21	[Misc] Use `torch.Tensor` for type annotation (#6505 )	2024-07-17 13:01:10 +00:00
Michael Goin	978aed5300	[Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` (#6081 )	2024-07-16 15:31:32 -07:00
Fish	ccb20db8bd	[Bugfix] Benchmark serving script used global parameter 'args' in function 'sample_random_requests' (#6428 )	2024-07-14 19:27:01 -07:00
Ethan Xu	dbfe254eda	[Feature] vLLM CLI (#5090 ) Co-authored-by: simon-mo <simon.mo@hey.com>	2024-07-14 15:36:43 -07:00
Kuntai Du	a4feba929b	[CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy (#5362 )	2024-07-11 13:28:38 -07:00
Robert Shaw	b675069d74	[ Misc ] Refactor Marlin Python Utilities (#6082 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>	2024-07-11 15:40:11 +00:00
Roger Wang	c4774eb841	[Bugfix] Fix snapshot download in serving benchmark (#6318 )	2024-07-11 07:04:05 +00:00
Haichuan	717f4bcea0	Feature/add benchmark testing (#5947 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-07-08 07:52:06 +00:00
Haichuan	333306a252	add benchmark for fix length input and output (#5857 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-07-07 07:42:13 +00:00
Alexander Matveev	3476ed0809	[Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) (#5602 )	2024-07-01 20:10:37 -07:00
James Whedbee	e373853e12	[Frontend] Relax api url assertion for openai benchmarking (#6046 )	2024-07-01 23:39:10 +00:00
zhyncs	bb60326836	[Misc] update benchmark backend for scalellm (#6018 )	2024-07-01 10:20:33 -07:00
mcalman	c4bca740e8	[Bugfix] fix missing last itl in openai completions benchmark (#5926 )	2024-06-29 10:34:42 +08:00
Ilya Lavrenov	57f09a419c	[Hardware][Intel] OpenVINO vLLM backend (#5379 )	2024-06-28 13:50:16 +00:00
Woo-Yeon Lee	2ce5d6688b	[Speculative Decoding] Support draft model on different tensor-parallel size than target model (#5414 )	2024-06-25 09:56:06 +00:00
Michael Goin	8065a7e220	[Frontend] Add FlexibleArgumentParser to support both underscore and dash in names (#5718 )	2024-06-20 17:00:13 -06:00
DearPlanet	d8714530d1	[Misc]Add param max-model-len in benchmark_latency.py (#5629 )	2024-06-19 18:19:08 +08:00
Tyler Michael Smith	6820724e51	[Bugfix] Fix w8a8 benchmarks for int8 case (#5643 )	2024-06-19 00:33:25 +00:00
Ronen Schaffer	7879f24dcc	[Misc] Add OpenTelemetry support (#4687 ) This PR adds basic support for OpenTelemetry distributed tracing. It includes changes to enable tracing functionality and improve monitoring capabilities. I've also added a markdown with print-screens to guide users how to use this feature. You can find it here	2024-06-19 01:17:03 +09:00
Kuntai Du	9e4e6fe207	[CI] the readability of benchmarking and prepare for dashboard (#5571 ) [CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard (#5571)	2024-06-17 11:41:08 -07:00
Kunshang Ji	728c4c8a06	[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814 ) Co-authored-by: Jiang Li <jiang1.li@intel.com> Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com> Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>	2024-06-17 11:01:25 -07:00
zhyncs	1f12122b17	[Misc] use AutoTokenizer for benchmark serving when vLLM not installed (#5588 )	2024-06-17 09:40:35 -07:00
Cody Yu	e2b85cf86a	Fix w8a8 benchmark and add Llama-3-8B (#5562 )	2024-06-17 06:48:06 +00:00
Cyrus Leung	0e9164b40a	[mypy] Enable type checking for test directory (#5017 )	2024-06-15 04:45:31 +00:00
Allen.Dou	d74674bbd9	[Misc] Fix arg names (#5524 )	2024-06-14 09:47:44 -07:00
Kuntai Du	319ad7f1d3	[CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with `perf-benchmarks` label (#5073 ) Co-authored-by: simon-mo <simon.mo@hey.com>	2024-06-13 22:36:20 -07:00
Tyler Michael Smith	85657b5607	[Kernel] Factor out epilogues from cutlass kernels (#5391 ) Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: zifeitong <zifei.tong@parasail.io> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-06-13 11:22:19 -07:00
Wang, Yi	88407532e7	[Bugfix]if the content is started with ":"(response of ping), client should i… (#5303 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-06-12 20:16:41 -07:00

1 2 3

145 Commits