4504 Commits

Author SHA1 Message Date
youkaichao
59fff4a01a
[core] improve error handling when wake up from sleep mode (#12981)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-10 09:38:57 +08:00
Lu Fang
29f1d47e73
[MISC] Always import version library first in the vllm package (#12979)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-02-09 18:56:40 +08:00
youkaichao
cf797aa856
[core] port pynvml into vllm codebase (#12963)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-09 15:00:00 +08:00
Woosuk Kwon
24700c346b
[V1] Cache uses_mrope in GPUModelRunner (#12969) 2025-02-08 15:32:32 -08:00
Patrick von Platen
d366ccc4e3
[RFC] [Mistral] FP8 format (#10130)
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-02-08 14:12:53 -07:00
Woosuk Kwon
870c37481e
[V1][Minor] Remove outdated comment (#12968)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-08 12:48:30 -08:00
Jee Jee Li
86222a3dab
[VLM] Merged multi-modal processor for GLM4V (#12449)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-02-08 20:32:16 +00:00
youkaichao
fe743b798d
[bugfix] fix early import of flash attention (#12959)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-09 00:06:56 +08:00
shangmingc
913df14da3
[Bugfix] Remove unused seq_group_metadata_list from ModelInputForGPU (#12935)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-02-08 14:46:19 +00:00
Cyrus Leung
8a69e0e20e
[CI/Build] Auto-fix Markdown files (#12941) 2025-02-08 04:25:15 -08:00
Isotr0py
4c8dd12ef3
[Misc] Add qwen2.5-vl BNB support (#12944) 2025-02-08 04:24:47 -08:00
Jun Duan
256a2d29dc
[Doc] Correct HF repository for TeleChat2 models (#12949) 2025-02-08 01:42:15 -08:00
Liangfu Chen
c45d398e6f
[CI] Resolve transformers-neuronx version conflict (#12925) 2025-02-08 01:41:35 -08:00
Jun Duan
011e612d92
[Misc] Log time consumption on weight downloading (#12926) 2025-02-08 09:16:42 +00:00
Varun Sundar Rabindranath
7e1837676a
[misc] Add LoRA to benchmark_serving (#12898)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2025-02-08 17:15:44 +08:00
Sanju C Sudhakaran
2880e21e3d
[Hardware][Intel-Gaudi] Enable long-contexts + LoRA support for Intel Gaudi (#12812)
Signed-off-by: Sanju C Sudhakaran <scsudhakaran@habana.ai>
2025-02-08 17:15:30 +08:00
wangxiyuan
407b5537db
[Build] Make pypi install work on CPU platform (#12874) 2025-02-08 01:15:15 -08:00
Woosuk Kwon
4ea48fb35c
[V1][Minor] Move cascade attn logic outside _prepare_inputs (#12943)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-08 00:39:09 -08:00
Shaoting
e31498bdcb
[Misc] Add offline test for disaggregated prefill (#12418) 2025-02-08 08:38:20 +00:00
youkaichao
91dd8f7aa6
[bugfix] respect distributed_executor_backend in world_size=1 (#12934)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-08 16:17:08 +08:00
zifeitong
d01f66b039
[Bugfix] Fix multi-round chat error when mistral tokenizer is used (#12859)
Signed-off-by: Zifei Tong <zifeitong@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-02-08 07:04:34 +00:00
Ke Zhao
cc01223f3b
[Misc] Fix typo in the example file (#12896)
Signed-off-by: Zhao Ke <yingxiongraomingzk@gmail.com>
2025-02-08 06:56:43 +00:00
Jee Jee Li
306923da82
[Bugfix] Fix Qwen2_5_VLForConditionalGeneration packed_modules_mapping (#12905) 2025-02-07 21:02:53 -08:00
Woosuk Kwon
3243158336
[V1] Move KV block hashes from Request to KVCacheManager (#12922)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-07 19:14:10 -08:00
Woosuk Kwon
b21f0f9d17
[V1][Minor] Remove outdated comment (#12928)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-07 19:07:37 -08:00
Lu Fang
45cbc4991d
[Bugfix] Fix disagg hang caused by the prefill and decode communication issues (#12723)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-02-07 16:39:50 -08:00
Robert Shaw
932c6b7461
[V1] LM Eval With Streaming Integration Tests (#11590) 2025-02-07 15:07:03 -08:00
TJian
eaa92d4437
[ROCm] [Feature] [Doc] [Dockerfile] [BugFix] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing (#12501) 2025-02-07 08:13:43 -08:00
afeldman-nm
0630d4537a
[V1] Logprobs and prompt logprobs support (#9880)
This PR is adding support for sample logprobs & prompt logprobs to vLLM v1.

New behavior:

- During model execution, model runner computes sample logprobs (if user-provided logprobs setting is not None) and prompt logprobs (if user-provided prompt_logprobs setting is not None). For both sample and prompt logprobs, the engine core returns 3 vectors: token ids, token logprob values, token ranks. Ranks reflect tokens' 1-indexed positions in the vocabulary vector after sorting the vocabulary by log probability in descending order.
- In scheduler.update_from_output(), sample and prompt logprobs are incorporated into the EngineCoreOutput data structure which is transferred to the engine client. If multiprocessing is enabled, then sample and prompt logprobs will be (de)serialized when the EngineCoreOutput data structure is (de)serialized.
- During output processing, the LogprobsProcessor transforms the triplet of token ids, token logprobs values, and token ranks into the OpenAI-compatible List[Dict[token id,Logprob]] format (for sample and prompt logprobs respectively.)
- Each Logprob instance (whether sample- or prompt-) consists of a token's log-probability, rank, and detokenized string representation. Note that logprob detokenization is handled by the LogprobsProcessor not the detokenizer.

Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>


Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-02-07 07:26:20 -08:00
Amit Garg
538fab93cd
PR #12718 (#12718) 2025-02-07 06:22:37 -08:00
Cyrus Leung
ce26b16268
[Misc] Remove unnecessary detokenization in multimodal processing (#12868) 2025-02-07 06:21:17 -08:00
Lu Fang
1918aa1b80
[MISC][EASY] Break check file names into entry and args in the pre-commit hooks (#12880)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-02-07 13:04:39 +00:00
Maximilien de Bayser
6e1fc61f0f
Prevent unecessary requests to huggingface hub (#12837) 2025-02-06 21:37:41 -08:00
Szymon Ożóg
aa375dca9f
[Bugfix] Missing quant_config in deepseek embedding layer (#12836) 2025-02-06 21:35:09 -08:00
ZSL98
433c4a4923
Make vllm compatible with verl (#12824)
Co-authored-by: zhangshulai <zhangshulai@bytedance.com>
2025-02-07 11:54:20 +08:00
Lucas Wilkinson
ef533d25fb
[Bugfix] FA2 illegal memory access (#12848) 2025-02-06 19:54:07 -08:00
Kevin H. Luu
b260782357
[misc] Revert # 12833 (#12857)
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal>
2025-02-06 16:29:12 -08:00
Lu Fang
741429a4cd
[MISC] Check space in the file names in the pre commit checks (#12804)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-02-06 15:36:21 -08:00
Yu Chin Fabian Lim
aff404571b
Add Bamba Model (#10909)
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-02-06 15:22:42 -08:00
Varun Sundar Rabindranath
467a96a541
[V1] LoRA Support (#10957)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2025-02-06 09:32:51 -08:00
Isotr0py
8108ac841d
[Bugfix] Fix unsupported FA version check for Turing GPU (#12828) 2025-02-06 09:18:22 -08:00
Jitse Klomp
afe74f7a96
[Doc] double quote cmake package in build.inc.md (#12840) 2025-02-06 09:17:55 -08:00
youkaichao
09b95e36ab
[torch.compile] PyTorch 2.6 and nightly compatibility (#12393)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-07 01:09:07 +08:00
Isotr0py
85ac82d228
[Kernel] Make rotary_embedding ops more flexible with input shape (#12777) 2025-02-06 08:46:13 -08:00
Cyrus Leung
1e57b1ee63
[Misc] Remove unnecessary decode call (#12833) 2025-02-06 08:45:44 -08:00
Kevin H. Luu
e152f29502
[misc] Reduce number of config file requests to HuggingFace (#12797)
Signed-off-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal>
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal>
2025-02-06 14:59:18 +00:00
Lucas Wilkinson
c786e757fa
[Attention] Use FA3 for MLA on Hopper (#12807)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-02-06 11:43:12 +00:00
Simon Mo
cefd56ee35
[Docs] Add Google Cloud Slides (#12814) 2025-02-06 01:02:38 -08:00
Dipika Sikka
7ca9934fe7
[Misc] Update w2 scale loading for GPTQMarlinMoE (#12757) 2025-02-06 01:02:14 -08:00
youkaichao
0408efc6d0
[Misc] Improve error message for incorrect pynvml (#12809)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-06 15:23:50 +08:00