20231088/vllm - vllm - Luminance Code Repo

20231088/vllm

Author	SHA1	Message	Date
Kuntai Du	81ede99ca4	[Core] Deprecating block manager v1 and make block manager v2 default (#8704 ) Removing the block manager v1. This is the initial piece of prefix-caching-centric design. In order to achieve prefix-caching-centric design, we need to simplify the code path so that we only use v2 block manager (which has much higher performance on prefix caching).	2024-10-17 11:38:15 -05:00
Cyrus Leung	7e7eae338d	[Misc] Standardize RoPE handling for Qwen2-VL (#9250 )	2024-10-16 13:56:17 +08:00
Grace Ho	5d264f4ab8	pass ignore_eos parameter to all benchmark_serving calls (#9349 )	2024-10-15 13:30:44 -07:00
Andy Dai	94bf9ae4e9	[Misc] Fix sampling from sonnet for long context case (#9235 )	2024-10-11 00:33:16 +00:00
sroy745	f3a507f1d3	[Core] Add an environment variable which needs to be set explicitly to allow BlockSpaceManagerV1 (#9149 )	2024-10-10 14:17:17 +08:00
youkaichao	18b296fdb2	[core] remove beam search from the core (#9105 )	2024-10-07 05:47:04 +00:00
Brendan Wong	168cab6bbf	[Frontend] API support for beam search (#9087 ) Co-authored-by: youkaichao <youkaichao@126.com>	2024-10-05 23:39:03 -07:00
Cody Yu	27302dd584	[Misc] Fix CI lint (#9085 )	2024-10-04 16:07:54 -07:00
Andy Dai	0cc566ca8f	[Misc] Add random seed for prefix cache benchmark (#9081 )	2024-10-04 21:58:57 +00:00
Kuntai Du	fbb74420e7	[CI] Update performance benchmark: upgrade trt-llm to r24.07, and add SGLang (#7412 )	2024-10-04 14:01:44 -07:00
vlsav	22f5851b80	Update benchmark_serving.py to read and write json-datasets, results in UTF8, for better compatibility with Windows (#8997 )	2024-10-01 11:07:06 -07:00
Chen Zhang	e585b583a9	[Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1 (#8891 )	2024-09-28 18:51:22 +00:00
Peter Pan	0e088750af	[MISC] Fix invalid escape sequence '\' (#8830 ) Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>	2024-09-27 01:13:25 -07:00
Cyrus Leung	3b00b9c26c	[Core] rename`PromptInputs` and `inputs` (#8876 )	2024-09-26 20:35:15 -07:00
Simon Mo	4f1ba0844b	Revert "rename PromptInputs and inputs with backward compatibility (#8760 ) (#8810 )	2024-09-25 10:36:26 -07:00
Cyrus Leung	28e1299e60	rename PromptInputs and inputs with backward compatibility (#8760 )	2024-09-25 09:36:47 -07:00
Archit Patke	6da1ab6b41	[Core] Adding Priority Scheduling (#5958 )	2024-09-24 19:50:50 -07:00
Simon Mo	3185fb0cca	Revert "[Core] Rename `PromptInputs` to `PromptType`, and `inputs` to `prompt`" (#8750 )	2024-09-24 05:45:20 +00:00
youkaichao	0250dd68c5	re-implement beam search on top of vllm core (#8726 ) Co-authored-by: Brendan Wong <bjwpokemon@gmail.com>	2024-09-23 22:08:12 -07:00
Lucas Wilkinson	86e9c8df29	[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (#7701 ) Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2024-09-23 13:46:26 -04:00
Cyrus Leung	0057894ef7	[Core] Rename `PromptInputs` and `inputs`(#8673 )	2024-09-20 19:00:54 -07:00
Kunshang Ji	855c8ae2c9	[MISC] remove engine_use_ray in benchmark_throughput.py (#8615 )	2024-09-18 22:33:20 -07:00
Kuntai Du	c52ec5f034	[Bugfix] fixing sonnet benchmark bug in benchmark_serving.py (#8616 )	2024-09-19 05:24:24 +00:00
Aaron Pham	9d104b5beb	[CI/Build] Update Ruff version (#8469 ) Signed-off-by: Aaron Pham <contact@aarnphm.xyz> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-09-18 11:00:56 +00:00
Cyrus Leung	6ffa3f314c	[CI/Build] Avoid CUDA initialization (#8534 )	2024-09-18 10:38:11 +00:00
Isotr0py	1b6de8352b	[Benchmark] Support sample from HF datasets and image input for benchmark_serving (#8495 )	2024-09-17 07:34:27 +00:00
Aarni Koskela	8baa454937	[Misc] Move device options to a single place (#8322 )	2024-09-11 13:25:58 -07:00
Wei-Sheng Chin	795b662cff	Enable Random Prefix Caching in Serving Profiling Tool (benchmark_serving.py) (#8241 )	2024-09-06 20:18:16 -07:00
afeldman-nm	e5cab71531	[Frontend] Add --logprobs argument to `benchmark_serving.py` (#8191 )	2024-09-06 09:01:14 -07:00
Cody Yu	77d9e514a2	[MISC] Replace input token throughput with total token throughput (#8164 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-09-04 20:23:22 +00:00
Nick Hill	d4db9f53c8	[Benchmark] Add `--async-engine` option to benchmark_throughput.py (#7964 )	2024-09-03 20:57:41 -04:00
Wei-Sheng Chin	0c785d344d	Add more percentiles and latencies (#7759 )	2024-08-29 16:48:11 -07:00
Philipp Schmid	345be0e244	[benchmark] Update TGI version (#7917 )	2024-08-27 15:07:53 -07:00
Megha Agarwal	2eedede875	[Core] Asynchronous Output Processor (#7049 ) Co-authored-by: Alexander Matveev <alexm@neuralmagic.com>	2024-08-26 20:53:20 -07:00
Alexander Matveev	9db93de20c	[Core] Add multi-step support to LLMEngine (#7789 )	2024-08-23 12:45:53 -07:00
Jiaxin Shan	d3b5b98021	[Misc] Enhance prefix-caching benchmark tool (#6568 )	2024-08-22 09:32:02 -07:00
Luka Govedič	7937009a7e	[Kernel] Replaced `blockReduce[...]` functions with `cub::BlockReduce` (#7233 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-08-21 20:18:00 -04:00
William Lin	dd53c4b023	[misc] Add Torch profiler support (#7451 ) Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-08-21 15:39:26 -07:00
Lucas Wilkinson	5288c06aa0	[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel (#7174 )	2024-08-20 07:09:33 -06:00
Mor Zusman	7fc23be81c	[Kernel] W8A16 Int8 inside FusedMoE (#7415 )	2024-08-16 10:06:51 -07:00
Roger Wang	70d268a399	[Bugfix] Fix ITL recording in serving benchmark (#7372 )	2024-08-09 10:00:00 -07:00
Luka Govedič	8d59dbb000	[Kernel] Add per-tensor and per-token AZP epilogues (#5941 ) Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2024-08-06 18:17:08 +00:00
Lucas Wilkinson	a8d604ca2a	[Misc] Disambiguate quantized types via a new ScalarType (#6396 )	2024-08-02 13:51:58 -07:00
Varun Sundar Rabindranath	35e9c12bfa	[Kernel] Tuned int8 Cutlass Kernels for SM75 (T4) (#6996 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-07-31 14:40:32 -07:00
Varun Sundar Rabindranath	766435e660	[Kernel] Tuned FP8 Kernels for Ada Lovelace (#6677 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-07-29 09:42:35 -06:00
Alexander Matveev	75acdaa4b6	[Kernel] Increase precision of GPTQ/AWQ Marlin kernel (#6795 )	2024-07-27 17:52:33 -04:00
Joe	14dbd5a767	[Model] H2O Danube3-4b (#6451 )	2024-07-26 20:47:50 -07:00
Cyrus Leung	739b61a348	[Frontend] Refactor prompt processing (#4028 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-07-22 10:13:53 -07:00
Woosuk Kwon	a9a2e74d21	[Misc] Use `torch.Tensor` for type annotation (#6505 )	2024-07-17 13:01:10 +00:00
Michael Goin	978aed5300	[Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` (#6081 )	2024-07-16 15:31:32 -07:00

1 2 3 4

171 Commits