20231088/vllm - vllm - Luminance Code Repo

20231088/vllm

Author	SHA1	Message	Date
Roger Wang	3ea2dc2ec4	[Misc] Remove deprecated arg for cuda graph capture (#9864 ) Signed-off-by: Roger Wang <ywang@roblox.com>	2024-10-31 07:22:07 +00:00
wangshuai09	4e2d95e372	[Hardware][ROCM] using current_platform.is_rocm (#9642 ) Signed-off-by: wangshuai09 <391746016@qq.com>	2024-10-28 04:07:00 +00:00
youkaichao	8549c82660	[core] cudagraph output with tensor weak reference (#9724 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-10-27 00:19:28 -07:00
yudian0504	8ca8954841	[Bugfix][Misc]: fix graph capture for decoder (#9549 )	2024-10-21 17:33:30 +00:00
Thomas Parnell	496e991da8	[Doc] Consistent naming of attention backends (#9498 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2024-10-21 22:29:57 +08:00
Chen Zhang	4fa3e33349	[Kernel] Support sliding window in flash attention backend (#9403 )	2024-10-20 10:57:52 -07:00
Kuntai Du	81ede99ca4	[Core] Deprecating block manager v1 and make block manager v2 default (#8704 ) Removing the block manager v1. This is the initial piece of prefix-caching-centric design. In order to achieve prefix-caching-centric design, we need to simplify the code path so that we only use v2 block manager (which has much higher performance on prefix caching).	2024-10-17 11:38:15 -05:00
Lucas Wilkinson	9d30a056e7	[misc] CUDA Time Layerwise Profiler (#8337 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-10-17 10:36:09 -04:00
Cyrus Leung	7e7eae338d	[Misc] Standardize RoPE handling for Qwen2-VL (#9250 )	2024-10-16 13:56:17 +08:00
Tyler Michael Smith	16b24e7dcd	[Bugfix] Bandaid fix for speculative decoding tests (#9327 )	2024-10-13 23:02:11 +00:00
Tyler Michael Smith	7342a7d7f8	[Model] Support Mamba (#6484 )	2024-10-11 15:40:06 +00:00
youkaichao	e4d652ea3e	[torch.compile] integration with compilation control (#9058 )	2024-10-10 12:39:36 -07:00
Alex Brooks	a3691b6b5e	[Core][Frontend] Add Support for Inference Time mm_processor_kwargs (#9131 ) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2024-10-08 14:12:56 +00:00
Cyrus Leung	0e36fd4909	[Misc] Move registry to its own file (#9064 )	2024-10-04 10:01:37 +00:00
youkaichao	9aaf14c62e	[misc] add forward context for attention (#9029 )	2024-10-03 12:09:42 -07:00
Varun Sundar Rabindranath	afb050b29d	[Core] CUDA Graphs for Multi-Step + Chunked-Prefill (#8645 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-10-02 19:44:39 +00:00
Lily Liu	1570203864	[Spec Decode] (1/2) Remove batch expansion (#8839 )	2024-10-01 16:04:42 -07:00
youkaichao	7da2487591	[torch.compile] fix tensor alias (#8982 )	2024-10-01 03:40:48 +00:00
Jee Jee Li	1cabfcefb6	[Misc] Adjust max_position_embeddings for LoRA compatibility (#8957 )	2024-09-30 12:57:39 +00:00
Jee Jee Li	3d49776bbb	[Model][LoRA]LoRA support added for MiniCPMV2.5 (#7199 )	2024-09-29 06:59:45 +00:00
Varun Sundar Rabindranath	19d02ff938	[Bugfix] Fix PP for Multi-Step (#8887 )	2024-09-28 08:52:46 -07:00
youkaichao	a9b15c606f	[torch.compile] use empty tensor instead of None for profiling (#8875 )	2024-09-27 08:11:32 -07:00
Huazhong Ji	ca2b628b3c	[MISC] rename CudaMemoryProfiler to DeviceMemoryProfiler (#8703 )	2024-09-22 10:44:09 -07:00
sroy745	1009e93c5d	[Encoder decoder] Add cuda graph support during decoding for encoder-decoder models (#7631 )	2024-09-17 07:35:01 -07:00
youkaichao	a36e070dad	[torch.compile] fix functionalization (#8480 )	2024-09-14 09:46:04 -07:00
youkaichao	0a4806f0a9	[plugin][torch.compile] allow to add custom compile backend (#8445 )	2024-09-13 09:32:42 -07:00
Cody Yu	a65cb16067	[MISC] Dump model runner inputs when crashing (#8305 )	2024-09-12 01:12:25 +00:00
bnellnm	73202dbe77	[Kernel][Misc] register ops to prevent graph breaks (#6917 ) Co-authored-by: Sage Moore <sage@neuralmagic.com>	2024-09-11 12:52:19 -07:00
Yang Fan	3b7fea770f	[Model][VLM] Add Qwen2-VL model support (#7905 ) Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-09-11 09:31:19 -07:00
Alexander Matveev	6d646d08a2	[Core] Optimize Async + Multi-step (#8050 )	2024-09-03 18:50:29 +00:00
afeldman-nm	428dd1445e	[Core] Logprobs support in Multi-step (#7652 )	2024-08-29 19:19:08 -07:00
kushanam	c334b1898b	extend cuda graph size for H200 (#7894 ) Co-authored-by: youkaichao <youkaichao@126.com>	2024-08-29 12:15:04 -07:00
Alexander Matveev	3f60f2244e	[Core] Combine async postprocessor and multi-step (#7921 )	2024-08-29 11:18:26 -07:00
youkaichao	a7f65c2be9	[torch.compile] remove reset (#7975 )	2024-08-28 17:32:26 -07:00
Cody Yu	e3580537a4	[Performance] Enable chunked prefill and prefix caching together (#7753 )	2024-08-28 00:36:31 -07:00
Alexander Matveev	f508e03e7f	[Core] Async_output_proc: Add virtual engine support (towards pipeline parallel) (#7911 )	2024-08-28 00:02:30 -07:00
bnellnm	c166e7e43e	[Bugfix] Allow ScalarType to be compiled with pytorch 2.3 and add checks for registering FakeScalarType and dynamo support. (#7886 )	2024-08-27 23:13:45 -04:00
youkaichao	64cc644425	[core][torch.compile] discard the compile for profiling (#7796 )	2024-08-26 21:33:58 -07:00
Megha Agarwal	2eedede875	[Core] Asynchronous Output Processor (#7049 ) Co-authored-by: Alexander Matveev <alexm@neuralmagic.com>	2024-08-26 20:53:20 -07:00
Abhinav Goyal	a3fce56b88	[Speculative Decoding] EAGLE Implementation with Top-1 proposer (#6830 )	2024-08-22 02:42:24 -07:00
Antoni Baum	3b682179dd	[Core] Add `AttentionState` abstraction (#7663 )	2024-08-20 18:50:45 +00:00
Roger Wang	bbf55c4805	[VLM] Refactor `MultiModalConfig` initialization and profiling (#7530 )	2024-08-17 13:30:55 -07:00
Mahesh Keralapura	93478b63d2	[Core] Fix tracking of model forward time in case of PP>1 (#7440 ) [Core] Fix tracking of model forward time to the span traces in case of PP>1 (#7440)	2024-08-16 13:46:01 -07:00
Cyrus Leung	3f674a49b5	[VLM][Core] Support profiling with multiple multi-modal inputs per prompt (#7126 )	2024-08-14 17:55:42 +00:00
Peter Salas	00c3d68e45	[Frontend][Core] Add plumbing to support audio language models (#7446 )	2024-08-13 17:39:33 +00:00
Mahesh Keralapura	933790c209	[Core] Add span metrics for model_forward, scheduler and sampler time (#7089 )	2024-08-09 13:55:13 -07:00
Alexander Matveev	74af2bbd90	[Bugfix] Fix reinit procedure in ModelInputForGPUBuilder (#7360 )	2024-08-09 16:35:49 +00:00
Mor Zusman	07ab160741	[Model][Jamba] Mamba cache single buffer (#6739 ) Co-authored-by: Mor Zusman <morz@ai21.com>	2024-08-09 10:07:06 -04:00
Alexander Matveev	e02ac55617	[Performance] Optimize e2e overheads: Reduce python allocations (#7162 )	2024-08-08 21:34:28 -07:00
Cody Yu	ef527be06c	[MISC] Use non-blocking transfer in prepare_input (#7172 )	2024-08-05 23:41:27 +00:00

1 2 3 4 5

204 Commits