4889 Commits

Author SHA1 Message Date
Cody Yu
f35f8e2242
[Build] Make sure local main branch is synced when VLLM_USE_PRECOMPILED=1 (#13921)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2025-03-03 16:43:14 +08:00
Mengqing Cao
b87c21fc89
[Misc][Platform] Move use allgather to platform (#14010)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
2025-03-03 15:40:04 +08:00
wang.yuqi
e584b85afd
[Misc] duplicate code in deepseek_v2 (#14106) 2025-03-03 14:10:11 +08:00
Sheng Yao
09e56f9262
[Bugfix] Explicitly include "omp.h" for MacOS to avoid installation failure (#14051) 2025-03-02 17:35:01 -08:00
Harry Mellor
cf069aa8aa
Update deprecated Python 3.8 typing (#13971) 2025-03-02 17:34:51 -08:00
Ce Gao
bf33700ecd
[v0][structured output] Support reasoning output (#12955)
Signed-off-by: Ce Gao <cegao@tensorchord.ai>
2025-03-02 14:49:42 -05:00
qux-bbb
bc6ccb9878
[Doc] Source building add clone step (#14086)
Signed-off-by: qux-bbb <1147635419@qq.com>
2025-03-02 10:59:50 +00:00
Jun Duan
82fbeae92b
[Misc] Accurately capture the time of loading weights (#14063)
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
2025-03-01 17:20:30 -08:00
Jee Jee Li
cc5e8f6db8
[Model] Add LoRA support for TransformersModel (#13770)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-03-02 09:17:34 +08:00
Chen Zhang
d54990da47
[v1] Add __repr__ to KVCacheBlock to avoid recursive print (#14081) 2025-03-01 20:46:02 +00:00
Chen Zhang
b9f1d4294e
[v1][Bugfix] Only cache blocks that are not in the prefix cache (#14073) 2025-03-01 08:25:54 +00:00
Sage Moore
b28246f6ff
[ROCm][V1][Bugfix] Add get_builder_cls method to the ROCmAttentionBackend class (#14065)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-03-01 07:18:32 +00:00
Woosuk Kwon
3b5567a209
[V1][Minor] Do not print attn backend twice (#13985)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-01 07:09:14 +00:00
Isotr0py
fdcc405346
[Doc] Consolidate whisper and florence2 examples (#14050) 2025-02-28 22:49:15 -08:00
Kuntai Du
8994dabc22
[Documentation] Add more deployment guide for Kubernetes deployment (#13841)
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: Kuntai Du <kuntai@uchicago.edu>
2025-03-01 06:44:24 +00:00
Li, Jiang
02296f420d
[Bugfix][V1][Minor] Fix shutting_down flag checking in V1 MultiprocExecutor (#14053) 2025-02-28 22:31:01 -08:00
YajieWang
6a92ff93e1
[Misc][Kernel]: Add GPTQAllSpark Quantization (#12931) 2025-02-28 22:30:59 -08:00
Jee Jee Li
6a84164add
[Bugfix] Add file lock for ModelScope download (#14060)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-03-01 06:10:28 +00:00
Brayden Zhong
f64ffa8c25
[Docs] Add pipeline_parallel_size to optimization docs (#14059)
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-03-01 05:43:54 +00:00
Luka Govedič
bd56c983d6
[torch.compile] Fix RMSNorm + quant fusion in the non-cutlass-fp8 case, rename RedundantReshapesPass to NoopEliminationPass (#10902)
Signed-off-by: luka <luka@neuralmagic.com>
2025-02-28 16:20:11 -07:00
Rui Qiao
084bbac8cc
[core] Bump ray to 2.43 (#13994)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2025-02-28 21:47:44 +00:00
Chen Zhang
28943d36ce
[v1] Move block pool operations to a separate class (#13973)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2025-02-28 20:53:31 +00:00
Andrey Talman
b526ca6726
Add RELEASE.md (#13926)
Signed-off-by: atalman <atalman@fb.com>
2025-02-28 12:25:50 -08:00
Chen Zhang
e7bd944e08
[v1] Cleanup the BlockTable in InputBatch (#13977)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-02-28 19:03:16 +00:00
iefgnoix
c3b6559a10
[V1][TPU] Integrate the new ragged paged attention kernel with vLLM v1 on TPU (#13379)
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-02-28 11:01:36 -07:00
Harry Mellor
4be4b26cb7
Fix entrypoint tests for embedding models (#14052) 2025-02-28 08:56:44 -08:00
Brayden Zhong
2aed2c9fa7
[Doc] Fix ROCm documentation (#14041)
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-02-28 16:42:07 +00:00
Yang Liu
9b61dd41e7
[Bugfix] Initialize attention bias on the same device as Query/Key/Value for QwenVL Series (#14031) 2025-02-28 07:36:08 -08:00
Cyrus Leung
f7bee5c815
[VLM][Bugfix] Enable specifying prompt target via index (#14038) 2025-02-28 07:35:55 -08:00
Jee Jee Li
e0734387fb
[Bugfix] Fix MoeWNA16Method activation (#14024) 2025-02-28 15:22:42 +00:00
Harry Mellor
f58f8b5c96
Update AutoAWQ docs (#14042)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-02-28 15:20:29 +00:00
Thibault Schueller
b3f7aaccd0
[V1][Minor] Restore V1 compatibility with LLMEngine class (#13090) 2025-02-28 00:52:25 -08:00
Kacper Pietkun
b91660ddb8
[Hardware][Intel-Gaudi] Regional compilation support (#13213) 2025-02-28 00:51:49 -08:00
Harry Mellor
76c89fcadd
Use smaller embedding model when not testing model specifically (#13891) 2025-02-28 00:50:43 -08:00
Mathis Felardos
b9e41734c5
[Bugfix][Disaggregated] patch the inflight batching on the decode node in SimpleConnector to avoid hangs in SimpleBuffer (nccl based) (#13987)
Signed-off-by: Mathis Felardos <mathis@mistral.ai>
2025-02-28 07:53:45 +00:00
Cyrus Leung
1088f06242
[Doc] Move multimodal Embedding API example to Online Serving page (#14017)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-02-28 07:12:04 +00:00
Travis Johnson
73e0225ee9
[Bugfix] Check that number of images matches number of <|image|> tokens with mllama (#13911)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2025-02-28 04:00:45 +00:00
Roger Wang
6c85da3a18
[V1]SupportsV0Only protocol for model definitions (#13959)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-02-27 20:02:15 -05:00
Jee Jee Li
67fc426845
[Misc] Print FusedMoE detail info (#13974) 2025-02-27 18:53:13 -05:00
Benjamin Chislett
9804145cac
[Model][Speculative Decoding] Expand DeepSeek MTP code to support k > n_predict (#13626)
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
2025-02-27 15:28:08 -08:00
Lucas Wilkinson
2e94b9cfbb
[Attention] Flash MLA for V1 (#13867)
Signed-off-by: Yang Chen <yangche@fb.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Yang Chen <yangche@fb.com>
2025-02-27 23:03:41 +00:00
qli88
8294773e48
[core] Perf improvement for DSv3 on AMD GPUs (#13718)
Signed-off-by: qli88 <qiang.li2@amd.com>
2025-02-27 22:14:30 +00:00
Woosuk Kwon
cd813c6d4d
[V1][Minor] Minor cleanup for GPU Model Runner (#13983)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-27 13:11:40 -08:00
Sage Moore
38acae6e97
[ROCm] Fix the Kernels, Core, and Prefix Caching AMD CI groups (#13970)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-02-27 20:31:47 +00:00
Cyrus Leung
a2dd48c386
[VLM] Deprecate legacy input mapper for OOT multimodal models (#13979)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-02-27 19:14:55 +00:00
dependabot[bot]
126f6beeb4
Bump azure/setup-helm from 4.2.0 to 4.3.0 (#13742)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-02-27 19:04:10 +00:00
Yang Chen
58d1b2aa77
[Attention] MLA support for V1 (#13789)
Signed-off-by: Yang Chen <yangche@fb.com>
2025-02-27 13:14:17 -05:00
Cyrus Leung
f1579b229d
[VLM] Generalized prompt updates for multi-modal processor (#13964)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-02-27 17:44:25 +00:00
Isotr0py
7864875879
[Bugfix] Fix qwen2.5-vl overflow issue (#13968)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-02-27 17:30:39 +00:00
Noam Gat
1dd422b64a
Update LMFE version to v0.10.11 to support new versions of transforme… (#13930) 2025-02-27 17:16:12 +00:00