4826 Commits

Author SHA1 Message Date
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
a31614e386
[ROCm][Quantization][Kernel] Use FP8 FNUZ when OCP flag is 0 or undefined (#13851)
Signed-off-by: Hollow Man <hollowman@opensuse.org>
2025-02-27 10:39:10 +08:00
Lucas Wilkinson
f95903909f
[Kernel] FlashMLA integration (#13747)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-02-27 10:35:08 +08:00
Woosuk Kwon
b382a7f28f
[BugFix] Make FP8 Linear compatible with torch.compile (#13918)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-26 13:48:55 -08:00
Wallas Henrique
4cb6fa0a9c
[Bugfix] Backend option to disable xgrammar any_whitespace (#12744)
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com>
2025-02-26 10:52:34 -08:00
Chauncey
d08b285adf
[Misc] fixed qwen_vl_utils parameter error (#13906) 2025-02-26 08:31:53 -08:00
Chenyaaang
b27122acc2
[TPU] use torch2.6 with whl package (#13860)
Signed-off-by: Chenyaaang <llccyy1212@gmail.com>
2025-02-26 08:18:54 -05:00
Cyrus Leung
934bb99c71
[Bugfix] Update expected token counts for Ultravox tests (#13895) 2025-02-26 04:56:50 -08:00
Joe Runde
3f808cc044
[Bugfix] Do not crash V0 engine on input errors (#13101)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2025-02-26 19:07:29 +08:00
Brayden Zhong
ec8a5e5386
[Misc]: Add support for goodput on guided benchmarking + TPOT calculation refactor (#13736)
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-02-26 19:06:47 +08:00
Florian Greinacher
215bf150a6
[Bugfix] Handle None parameters in Mistral function calls. (#13786) 2025-02-26 03:06:21 -08:00
Harry Mellor
0ecdd98031
Add comments on accessing kv_cache and attn_metadata (#13887)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-02-26 18:41:02 +08:00
Cyrus Leung
7b700ec8c8
[Bugfix] Add test example for Ultravox v0.5 (#13890) 2025-02-26 02:31:43 -08:00
Roger Wang
7ca1da020f
[Misc] Fix input processing for Ultravox (#13871) 2025-02-25 23:56:34 -08:00
Jee Jee Li
5157338ed9
[Misc] Improve LoRA spelling (#13831) 2025-02-25 23:43:01 -08:00
Seth Kimmel
e206b54331
[v0][Core] Use xgrammar shared context to avoid copy overhead for offline engine (#13837)
Signed-off-by: Seth Kimmel <seth.kimmel3@gmail.com>
2025-02-26 14:58:24 +08:00
Sage Moore
1d35662e6d
[ROCm] Disable chunked prefill/prefix caching when running MLA on non-cuda platforms (#13844)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-02-26 14:56:58 +08:00
Albert
e656f638de
[Doc] fix the incorrect module path of tensorize_vllm_model (#13863) 2025-02-25 22:56:19 -08:00
Harry Mellor
145944cb94
Improve pipeline partitioning (#13839) 2025-02-25 18:53:56 -08:00
Henry Tsang
094b7d9496
[Kernel][Build/CI] Bump CUTLASS to 3.8 and add initializers for cutlass epilogues (#13797) 2025-02-25 18:52:03 -08:00
Chenguang Li
e1fe7591f2
[Misc]Code Cleanup (#13859)
Signed-off-by: noemotiovon <noemotiovon@gmail.com>
Co-authored-by: noemotiovon <noemotiovon@gmail.com>
2025-02-26 10:44:30 +08:00
Lily Liu
5629f26df7
[V1][Spec Decode] Change Spec Decode Rejection Sampling API (#13729) 2025-02-25 18:14:48 -08:00
Rui Qiao
9ba28043b5
[misc] Show driver IP info when Ray fails to allocate driver worker (#13858)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2025-02-26 09:53:43 +08:00
Harry Mellor
24679788ed
DeepSeek V2/V3/R1 only place lm_head on last pp rank (#13833)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-02-26 01:24:57 +00:00
Michael Goin
07c4353057
[Model] Support Grok1 (#13795)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-02-26 01:07:12 +00:00
Harry Mellor
34e3494e70
Fix failing MyGemma2Embedding test (#13820)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-02-25 12:33:03 -08:00
Liangfu Chen
f75aa72732
[Neuron] Add custom_ops for neuron backend (#13246)
Signed-off-by: Liangfu Chen <liangfc@amazon.com>
Co-authored-by: George Novack <gnovack@amazon.com>
Co-authored-by: Aoyu Zhang <aoyuzhan@amazon.com>
2025-02-25 11:47:49 -08:00
Chen1022
340e39e387
Fix string parsing error (#13825) 2025-02-25 08:20:29 -08:00
Cyrus Leung
f4133ce4e5
[Bugfix] Revert inspection code in #13743 (#13832)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-02-26 00:18:50 +08:00
Wen Sun
6522d55b6f
Fix /v1/audio/transcriptions Bad Request Error (#13811) 2025-02-25 06:03:33 -08:00
Isotr0py
6ff518626c
[Bugfix] Fix deepseek-vl2 inference with more than 2 images (#13818) 2025-02-25 06:03:02 -08:00
Nichols A. Romero
fa82074167
[Bugfix] Flush TunableOp results before worker processes are destroyed. (#13623)
Signed-off-by: Nichols A. Romero <nick.romero@amd.com>
2025-02-25 11:08:20 +00:00
Junlin Zhou
75e9d49796
[Bugfix] Initialize attention bias on the same device as Query/Key/Value (#13468) 2025-02-25 02:13:09 -08:00
Chen1022
32c3b6bfd1
[Misc]Clarify Error Handling for Non-existent Model Paths and HF Repo IDs (#13724)
Signed-off-by: Chen-0210 <chenjincong11@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-02-25 10:12:19 +00:00
Jee Jee Li
37b6cb4985
[CI/Build] Fix V1 LoRA failure (#13767) 2025-02-25 02:01:15 -08:00
Gregory Shtrasberg
aabeb2688f
[ROCm][Quantization][Kernel] Using HIP FP8 header (#12593) 2025-02-25 00:39:59 -08:00
Jiayi Yao
2f42a4888c
[Feature] Support KV cache offloading and disagg prefill with LMCache connector. (#12953) 2025-02-25 00:38:42 -08:00
Rui Qiao
3173c3b34e
[misc] Clean up ray compiled graph type hints (#13731) 2025-02-25 00:37:08 -08:00
Shanshan Shen
2d87d7d1ac
[Bugfix] Modify modelscope api usage in transformer_utils (#13807) 2025-02-25 00:36:07 -08:00
Russell Bryant
aab392774b
[Core] xgrammar: Expand list of unsupported jsonschema keywords (#13783)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-02-25 08:21:25 +00:00
Cyrus Leung
6724e79164
[Misc] Check that the model can be inspected upon registration (#13743) 2025-02-25 00:18:19 -08:00
Varun Sundar Rabindranath
03f48b3db6
[Core] LoRA V1 - Add add/pin/list/remove_lora functions (#13705) 2025-02-25 00:18:02 -08:00
Michael Goin
4d251ad00e
Fix CompressedTensorsWNA16MoE with grouped scales (#13769) 2025-02-25 00:17:14 -08:00
Michael Goin
18e505930d
[Bugfix] Support MLA for CompressedTensorsWNA16 (#13725)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-02-25 06:10:31 +00:00
Lucas Wilkinson
4a8cfc7551
[Bugfix] Fix deepseek-v2 error: "missing 1 required positional argument: 'residual'" (#13802) 2025-02-24 20:33:59 -08:00
Mark McLoughlin
bc32bc73aa
[V1][Metrics] Implement vllm:lora_requests_info metric (#13504) 2025-02-24 20:01:33 -08:00
wangxiyuan
ab1091d5f2
[Misc][Attention][Quantization] init property earlier (#13733)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-02-25 03:19:30 +00:00
Tyler Michael Smith
1e15aaef56
[Bugfix][Quantization] Fix FP8 + EP (#13784)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-02-25 10:54:17 +08:00
cjackal
51010a1807
[Misc] set single whitespace between log sentences (#13771)
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com>
2025-02-25 10:26:12 +08:00
Eli Boyarski
7196a3b1db
[Doc] arg_utils.py: fixed a typo (#13785) 2025-02-24 18:23:04 -08:00
Harry Mellor
cdc1fa12eb
Remove unused kwargs from model definitions (#13555) 2025-02-24 17:13:52 -08:00