Chen Zhang
69d765f5a5
[V1] Move more control of kv cache initialization from model_executor to EngineCore ( #11960 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2025-01-17 07:39:35 +00:00
Roger Wang
70755e819e
[V1][Core] Autotune encoder cache budget ( #11895 )
...
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-01-15 11:29:00 -08:00
Chen Zhang
cf5f000d21
[torch.compile] Hide KV cache behind torch.compile boundary ( #11677 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-01-10 13:14:42 +08:00
Roger Wang
91b361ae89
[V1] Extend beyond image modality and support mixed-modality inference with Llava-OneVision ( #11685 )
...
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-06 19:58:16 +00:00
Woosuk Kwon
06bfb51963
[V1] Add BlockTable class ( #11693 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-06 14:24:42 +09:00
Yan Burman
300acb8347
[Core][Bugfix] Use correct device to initialize GPU data during CUDA-graph-capture ( #11233 )
...
Signed-off-by: Yan Burman <yanburman@users.noreply.github.com>
Signed-off-by: Ido Asraff <idoa@atero.ai>
2025-01-04 14:50:16 +08:00
Woosuk Kwon
b55ed6ef8a
[V1][Minor] Optimize token_ids_cpu copy ( #11692 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-02 12:04:58 -07:00
Woosuk Kwon
73001445fb
[V1] Implement Cascade Attention ( #11635 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-01 21:56:46 +09:00
Roger Wang
e7c7c5e822
[V1][VLM] V1 support for selected single-image models. ( #11632 )
...
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Isotr0py <2037008807@qq.com>
2024-12-31 21:17:22 +00:00
sroy745
dcb1a944d4
[V1] Adding min tokens/repetition/presence/frequence penalties to V1 sampler ( #10681 )
...
Signed-off-by: Sourashis Roy <sroy@roblox.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-26 19:02:58 +09:00
Roger Wang
04139ade59
[V1] Fix profiling for models with merged input processor ( #11370 )
...
Signed-off-by: ywang96 <ywang@roblox.com>
2024-12-20 12:04:21 +00:00
Roger Wang
7379b3d4b2
[V1] Fix multimodal profiling for Molmo
( #11325 )
...
Signed-off-by: ywang96 <ywang@example.com>
Co-authored-by: ywang96 <ywang@example.com>
2024-12-19 16:27:22 +00:00
Alexander Matveev
fdea8ec167
[V1] VLM - enable processor cache by default ( #11305 )
...
Signed-off-by: Alexander Matveev <alexm@neuralmagic.com>
2024-12-18 18:54:46 -05:00
Roger Wang
59c9b6ebeb
[V1][VLM] Proper memory profiling for image language models ( #11210 )
...
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: ywang96 <ywang@example.com>
2024-12-16 22:10:57 -08:00
Woosuk Kwon
25ebed2f8c
[V1][Minor] Cache np arange to reduce input preparation overhead ( #11214 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-15 13:33:00 -08:00
Mark McLoughlin
6d917d0eeb
Enable mypy checking on V1 code ( #11105 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2024-12-14 09:54:04 -08:00
youkaichao
be39e3cd18
[core] clean up cudagraph batchsize padding logic ( #10996 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-13 06:57:50 +00:00
Woosuk Kwon
f092153fbe
[V1] Use more persistent buffers to optimize input preparation overheads ( #11111 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-11 23:14:20 -08:00
Woosuk Kwon
d643c2aba1
[V1] Use input_ids as input for text-only models ( #11032 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-11 10:49:23 -08:00
Mor Zusman
ffa48c9146
[Model] PP support for Mamba-like models ( #10992 )
...
Signed-off-by: mzusman <mor.zusmann@gmail.com>
2024-12-10 21:53:37 -05:00
youkaichao
75f89dc44c
[torch.compile] add a flag to track batchsize statistics ( #11059 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-10 12:40:52 -08:00
Tyler Michael Smith
28b3a1c7e5
[V1] Multiprocessing Tensor Parallel Support for v1 ( #9856 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-12-10 06:28:14 +00:00
youkaichao
1a2f8fb828
[v1] fix use compile sizes ( #11000 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-09 13:47:24 -08:00
Varun Sundar Rabindranath
25b79d9fd3
[V1] Input Batch Relocation ( #10962 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-12-09 09:33:41 -08:00
Woosuk Kwon
2a56e1264f
[V1] Fix when max_model_len is not divisible by block_size ( #10903 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-04 16:54:05 -08:00
youkaichao
dc5ce861bf
[torch.compile] remove compilation_context and simplify code ( #10838 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-03 06:19:02 +00:00
Roger Wang
2f0a0a17a4
[V1] Refactor model executable interface for multimodal models ( #10570 )
...
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-11-26 20:46:11 +00:00
Sage Moore
9a88f89799
custom allreduce + torch.compile ( #10121 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-11-25 22:00:16 -08:00
youkaichao
eebad39f26
[torch.compile] support all attention backends ( #10558 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-22 14:04:42 -08:00
Woosuk Kwon
f9310cbd0c
[V1] Fix Compilation config & Enable CUDA graph by default ( #10528 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-21 12:53:39 -08:00
Luka Govedič
8b0fe06c89
[torch.compile] Inductor code caching fix ( #10273 )
...
Signed-off-by: luka <luka@neuralmagic.com>
Signed-off-by: Luka Govedic <luka.govedic@gmail.com>
2024-11-20 21:44:57 -08:00
youkaichao
803f37eaaa
[6/N] torch.compile rollout to users ( #10437 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-19 10:09:03 -08:00
youkaichao
4fd9375028
[2/N][torch.compile] make compilation cfg part of vllm cfg ( #10383 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-16 18:02:14 -08:00
Cyrus Leung
0b8bb86bf1
[1/N] Initial prototype for multi-modal processor ( #10044 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-13 12:39:03 +00:00
Woosuk Kwon
bbd3e86926
[V1] Support VLMs with fine-grained scheduling ( #9871 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-11-13 04:53:13 +00:00
Woosuk Kwon
1f55e05713
[V1] Enable Inductor when using piecewise CUDA graphs ( #10268 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-12 13:39:56 -08:00
Woosuk Kwon
9d5b4e4dea
[V1] Enable custom ops with piecewise CUDA graphs ( #10228 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-11 11:58:07 -08:00
Woosuk Kwon
fe15729a2b
[V1] Use custom ops for piecewise CUDA graphs ( #10227 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-11 11:26:48 -08:00
Woosuk Kwon
d7a4f2207b
[V1] Do not use inductor for piecewise CUDA graphs ( #10225 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-11 11:05:57 -08:00
Woosuk Kwon
b5815c8413
[V1] Fix non-cudagraph op name ( #10166 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-08 10:23:04 -08:00
Nick Hill
1fa020c539
[V1][BugFix] Fix Generator construction in greedy + seed case ( #10097 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2024-11-07 05:06:57 +00:00
Joe Runde
d58268c56a
[V1] Make v1 more testable ( #9888 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-11-06 11:57:35 -08:00
Woosuk Kwon
4089985552
[V1] Integrate Piecewise CUDA graphs ( #10058 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-05 22:16:04 -08:00
Nick Hill
1f1b6d6eda
[V1] Support per-request seed ( #9945 )
...
Signed-off-by: Nick Hill <nickhill@us.ibm.com>
2024-11-03 09:14:17 -08:00
youkaichao
cea808f325
[3/N] model runner pass the whole config to model ( #9958 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-02 12:08:49 -07:00
youkaichao
e893795443
[2/N] executor pass the complete config to worker/modelrunner ( #9938 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2024-11-02 07:35:05 -07:00
Woosuk Kwon
6c5af09b39
[V1] Implement vLLM V1 [1/N] ( #9289 )
2024-10-22 01:24:07 -07:00