270 Commits

Author SHA1 Message Date
aniaan
3963a5335b
[Misc] refactor(config): clean up unused code (#6320) 2024-07-11 09:39:07 +00:00
Swapnil Parekh
4d6ada947c
[CORE] Adding support for insertion of soft-tuned prompts (#4645)
Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>
Co-authored-by: Joe G <joseph.granados@h2o.ai>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-07-09 13:26:36 -07:00
youkaichao
3de6e6a30e
[core][distributed] support n layers % pp size != 0 (#6115) 2024-07-03 16:40:31 -07:00
xwjiang2010
d9e98f42e4
[vlm] Remove vision language config. (#6089)
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-03 22:14:16 +00:00
youkaichao
3c6325f0fc
[core][distributed] custom allreduce when pp size > 1 (#6117) 2024-07-03 14:41:32 -07:00
Nick Hill
d830656a97
[BugFix] Avoid unnecessary Ray import warnings (#6079) 2024-07-03 14:09:40 +08:00
Cyrus Leung
9831aec49f
[Core] Dynamic image size support for VLMs (#5276)
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: ywang96 <ywang@roblox.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-07-02 20:34:00 -07:00
Mor Zusman
9d6a8daa87
[Model] Jamba support (#4115)
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Co-authored-by: Erez Schwartz <erezs@ai21.com>
Co-authored-by: Mor Zusman <morz@ai21.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
Co-authored-by: Tomer Asida <tomera@ai21.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-07-02 23:11:29 +00:00
Murali Andoorveedu
c5832d2ae9
[Core] Pipeline Parallel Support (#4412)
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-07-02 10:58:08 -07:00
Sirej Dua
15aba081f3
[Speculative Decoding] MLPSpeculator Tensor Parallel support (1/2) (#6050)
Co-authored-by: Sirej Dua <sirej.dua@databricks.com>
Co-authored-by: Sirej Dua <Sirej Dua>
2024-07-02 07:20:29 -07:00
xwjiang2010
98d6682cd1
[VLM] Remove image_input_type from VLM config (#5852)
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-02 07:57:09 +00:00
sroy745
80ca1e6a3a
[Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker (#5348) 2024-07-01 00:33:05 -07:00
wangding zeng
be0b3af9e0
Support Deepseek-V2 (#4650)
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
2024-06-28 13:24:57 -07:00
Thomas Parnell
ec1ad0046c
[Bugfix] Better error message for MLPSpeculator when num_speculative_tokens is set too high (#5894)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-06-28 07:42:17 -07:00
Ilya Lavrenov
57f09a419c
[Hardware][Intel] OpenVINO vLLM backend (#5379) 2024-06-28 13:50:16 +00:00
Cyrus Leung
5cbe8d155c
[Core] Registry for processing model inputs (#5214)
Co-authored-by: ywang96 <ywang@roblox.com>
2024-06-28 12:09:56 +00:00
Woosuk Kwon
79c92c7c8a
[Model] Add Gemma 2 (#5908) 2024-06-27 13:33:56 -07:00
Nick Hill
691e29ecf3
[BugFix] Fix MLPSpeculator handling of num_speculative_tokens (#5876) 2024-06-27 10:59:33 -07:00
Matt Wong
dd793d1de5
[Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes (#5422) 2024-06-25 15:56:15 -07:00
Woo-Yeon Lee
2ce5d6688b
[Speculative Decoding] Support draft model on different tensor-parallel size than target model (#5414) 2024-06-25 09:56:06 +00:00
Joshua Rosenkranz
b12518d3cf
[Model] MLPSpeculator speculative decoding support (#4947)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Davis Wertheimer <Davis.Wertheimer@ibm.com>
2024-06-20 20:23:12 -04:00
Ronen Schaffer
7879f24dcc
[Misc] Add OpenTelemetry support (#4687)
This PR adds basic support for OpenTelemetry distributed tracing.
It includes changes to enable tracing functionality and improve monitoring capabilities.

I've also added a markdown with print-screens to guide users how to use this feature. You can find it here
2024-06-19 01:17:03 +09:00
Bruce Fontaine
a3e8a05d4c
[Bugfix] Fix KV head calculation for MPT models when using GQA (#5142) 2024-06-17 15:26:41 -07:00
Kunshang Ji
728c4c8a06
[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814)
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com>
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-06-17 11:01:25 -07:00
Amit Garg
9333fb8eb9
[Model] Rename Phi3 rope scaling type (#5595) 2024-06-17 12:04:14 -04:00
Nick Hill
1c0afa13c5
[BugFix] Don't start a Ray cluster when not using Ray (#5570) 2024-06-15 16:30:51 -07:00
SangBin Cho
e691918e3b
[misc] Do not allow to use lora with chunked prefill. (#5538)
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2024-06-15 14:59:36 +00:00
Antoni Baum
50eed24d25
Add cuda_device_count_stateless (#5473) 2024-06-13 16:06:49 -07:00
Cody Yu
30299a41fa
[MISC] Remove FP8 warning (#5472)
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
2024-06-13 11:22:30 -07:00
Woosuk Kwon
1a8bfd92d5
[Hardware] Initial TPU integration (#5292) 2024-06-12 11:53:03 -07:00
Nick Hill
99dac099ab
[Core][Doc] Default to multiprocessing for single-node distributed case (#5230)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-06-11 11:10:41 -07:00
sasha0552
dcbf4286af
[Frontend] Customizable RoPE theta (#5197) 2024-06-11 10:42:26 -07:00
maor-ps
351d5e7b82
[Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs (#5312)
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-06-11 10:30:31 +08:00
Dipika Sikka
5884c2b454
[Misc] Update to comply with the new compressed-tensors config (#5350)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-06-10 03:49:46 +00:00
Roger Wang
7a9cb294ae
[Frontend] Add OpenAI Vision API Support (#5237)
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-06-07 11:23:32 -07:00
liuyhwangyh
4efff036f0
Bugfix: fix broken of download models from modelscope (#5233)
Co-authored-by: mulin.lyh <mulin.lyh@taobao.com>
2024-06-06 09:28:10 -07:00
Cyrus Leung
89c920785f
[CI/Build] Update vision tests (#5307) 2024-06-06 05:17:18 -05:00
Nick Hill
3d33e372a1
[BugFix] Fix log message about default max model length (#5284) 2024-06-05 14:53:16 -07:00
Kaiyang Chen
10c38e3e46
[Misc]: Implement CPU/GPU swapping in BlockManagerV2 (#3834) 2024-06-03 13:37:11 -07:00
Cyrus Leung
7a64d24aad
[Core] Support image processor (#4197) 2024-06-02 22:56:41 -07:00
chenqianfzh
b9c0605a8e
[Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776) 2024-06-01 14:51:10 -06:00
Robert Shaw
d910816c73
[Bugfix] Automatically Detect SparseML models (#5119) 2024-05-30 12:58:37 +00:00
Zhuohan Li
1102bef219
[Bugfix / Core] Prefix Caching Guards (merged with main) (#4846)
Co-authored-by: rsnm2 <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-05-27 15:18:17 -07:00
Cody Yu
a3a73ab069
[Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893)
The 2nd PR for #4532.

This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
2024-05-22 13:28:20 -07:00
sasha0552
9b9a10d6cb
[Frontend] Dynamic RoPE scaling (#4638) 2024-05-22 01:32:35 -04:00
SangBin Cho
2e9a2227ec
[Lora] Support long context lora (#4787)
Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.

It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.

Follow up of https://github.com/vllm-project/vllm/pull/3095/files
2024-05-18 16:05:23 +09:00
Alexei-V-Ivanov-AMD
26148120b3
[Build/CI] Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests (#4797) 2024-05-16 20:58:25 -07:00
Alexander Matveev
6979ade384
Add GPTQ Marlin 2:4 sparse structured support (#4790)
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
2024-05-16 12:56:15 -04:00
Aurick Qiao
30e754390c
[Core] Implement sharded state loader (#4690)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-05-15 22:11:54 -07:00
zifeitong
a5675d348b
[Bugfix] Properly set distributed_executor_backend in ParallelConfig (#4816) 2024-05-15 07:22:09 -07:00