SangBin Cho
847cdcca1c
[CI] Upgrade codespell version. ( #5381 )
2024-06-12 10:06:14 -07:00
Simon Mo
e3c12bf6d2
Revert "[CI/Build] Add is_quant_method_supported
to control quantization test configurations" ( #5463 )
2024-06-12 10:03:24 -07:00
Michael Goin
3dd6853bc8
[CI/Build] Add is_quant_method_supported
to control quantization test configurations ( #5253 )
2024-06-12 09:58:02 -07:00
youkaichao
8f89d72090
[Doc] add common case for long waiting time ( #5430 )
2024-06-11 11:12:13 -07:00
Nick Hill
99dac099ab
[Core][Doc] Default to multiprocessing for single-node distributed case ( #5230 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-06-11 11:10:41 -07:00
youkaichao
c4bd03c7c5
[Core][Distributed] add same-node detection ( #5369 )
2024-06-11 10:53:59 -07:00
sasha0552
dcbf4286af
[Frontend] Customizable RoPE theta ( #5197 )
2024-06-11 10:42:26 -07:00
Ali Panahi
00e6a2dc53
[Bugfix] fix lora_dtype value type in arg_utils.py ( #5398 )
2024-06-11 10:40:23 -07:00
Junichi Sato
2e02311a1b
[Bugfix] Fix MultiprocessingGPUExecutor.check_health
when world_size == 1 ( #5254 )
2024-06-11 10:38:07 -07:00
Cade Daniel
89ec06c33b
[Docs] [Spec decode] Fix docs error in code example ( #5427 )
2024-06-11 10:31:56 -07:00
Kuntai Du
9fde251bf0
[Doc] Add an automatic prefix caching section in vllm documentation ( #5324 )
...
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-06-11 10:24:59 -07:00
Cade Daniel
4c2ffb28ff
[Speculative decoding] Initial spec decode docs ( #5400 )
2024-06-11 10:15:40 -07:00
SangBin Cho
246598a6b1
[CI] docfix ( #5410 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: ywang96 <ywang@roblox.com>
2024-06-11 01:28:50 -07:00
Woosuk Kwon
8bab4959be
[Misc] Remove VLLM_BUILD_WITH_NEURON env variable ( #5389 )
2024-06-11 00:37:56 -07:00
Roger Wang
3c4cebf751
[Doc][Typo] Fixing Missing Comma ( #5403 )
2024-06-11 00:20:28 -07:00
youkaichao
d8f31f2f8b
[Doc] add debugging tips ( #5409 )
2024-06-10 23:21:43 -07:00
Cyrus Leung
640052b069
[Bugfix][Frontend] Cleanup "fix chat logprobs" ( #5026 )
2024-06-10 22:36:46 -07:00
maor-ps
351d5e7b82
[Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs ( #5312 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-06-11 10:30:31 +08:00
Nick Hill
a008629807
[Misc] Various simplifications and typing fixes ( #5368 )
2024-06-11 10:29:02 +08:00
Kevin H. Luu
76477a93b7
[ci] Fix Buildkite agent path ( #5392 )
...
Signed-off-by: kevin <kevin@anyscale.com>
2024-06-10 18:58:07 -07:00
Michael Goin
77c87beb06
[Doc] Add documentation for FP8 W8A8 ( #5388 )
2024-06-10 18:55:12 -06:00
Simon Mo
114332b88e
Bump version to v0.5.0 ( #5384 )
2024-06-10 15:56:06 -07:00
Woosuk Kwon
cb77ad836f
[Docs] Alphabetically sort sponsors ( #5386 )
2024-06-10 15:17:19 -05:00
Roger Wang
856c990041
[Docs] Add Docs on Limitations of VLM Support ( #5383 )
2024-06-10 09:53:50 -07:00
Kevin H. Luu
c5602f0baa
[ci] Mount buildkite agent on Docker container to upload benchmark results ( #5330 )
...
Signed-off-by: kevin <kevin@anyscale.com>
2024-06-10 09:22:34 -07:00
Kevin H. Luu
f7f9c5f97b
[ci] Use small_cpu_queue for doc build ( #5331 )
...
Signed-off-by: kevin <kevin@anyscale.com>
2024-06-10 09:21:11 -07:00
Cyrus Leung
2c0d933594
[Bugfix] Fix LLaVA-NeXT ( #5380 )
2024-06-10 15:38:47 +00:00
Itay Etelis
774d1035e4
[Feature][Frontend]: Continued stream_options
implementation also in CompletionRequest ( #5319 )
2024-06-10 14:22:09 +00:00
Cyrus Leung
6b29d6fe70
[Model] Initial support for LLaVA-NeXT ( #4199 )
...
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-06-10 12:47:15 +00:00
Cyrus Leung
0bfa1c4f13
[Misc] Improve error message when LoRA parsing fails ( #5194 )
2024-06-10 19:38:49 +08:00
youkaichao
c81da5f56d
[misc][typo] fix typo ( #5372 )
2024-06-10 09:51:02 +00:00
Roger Wang
68bc81703e
[Frontend][Misc] Enforce Pixel Values as Input Type for VLMs in API Server ( #5374 )
2024-06-10 09:13:39 +00:00
Dipika Sikka
5884c2b454
[Misc] Update to comply with the new compressed-tensors
config ( #5350 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-06-10 03:49:46 +00:00
Bla_ckB
45f92c00cf
[Bugfix] Fix KeyError: 1 When Using LoRA adapters ( #5164 )
2024-06-09 16:23:14 -07:00
bnellnm
5467ac3196
[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops ( #5047 )
2024-06-09 16:23:30 -04:00
youkaichao
5d7e3d0176
[mis][ci/test] fix flaky test in test_sharded_state_loader.py ( #5361 )
...
[mis][ci/test] fix flaky test in tests/test_sharded_state_loader.py (#5361 )
2024-06-09 03:50:14 +00:00
youkaichao
0373e1837e
[Core][CUDA Graph] add output buffer for cudagraph ( #5074 )
...
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (#5074 )
2024-06-08 19:14:43 -07:00
Michael Goin
c09dade2a2
[Misc][Breaking] Change FP8 checkpoint format from act_scale -> input_scale ( #5353 )
2024-06-08 13:54:05 -04:00
youkaichao
8ea5e44a43
[CI/Test] improve robustness of test (vllm_runner) ( #5357 )
...
[CI/Test] improve robustness of test by replacing del with context manager (vllm_runner) (#5357 )
2024-06-08 08:59:20 +00:00
youkaichao
9fb900f90c
[CI/Test] improve robustness of test (hf_runner) ( #5347 )
...
[CI/Test] improve robustness of test by replacing del with context manager (hf_runner) (#5347 )
2024-06-07 22:31:32 -07:00
Hongxia Yang
c96fc06747
[ROCm][AMD] Use pytorch sdpa math backend to do naive attention ( #4965 )
2024-06-07 19:13:12 -07:00
Benjamin Kitor
b3376e5c76
[Misc] Add args for selecting distributed executor to benchmarks ( #5335 )
2024-06-08 09:20:16 +08:00
Cheng Li
e69ded7d1c
[Bug Fix] Fix the support check for FP8 CUTLASS ( #5352 )
...
Bug description:
With torch 2.4.0.dev20240603+cu121,
cutlass_fp8_supported outputs False, and the (capability, version) before the comparison is (90, 11111111112)
This PR fixes the support check for FP8 CUTLASS ( cutlass_fp8_supported) which was introduced in https://github.com/vllm-project/vllm/pull/5183 .
2024-06-08 00:42:05 +00:00
Calvinn Ng
767c727a81
fix DbrxFusedNormAttention missing cache_config ( #5340 )
...
Co-authored-by: team <calvinn.ng@ahrefs.com>
2024-06-07 14:10:21 -07:00
Jie Fu (傅杰)
6840a71610
[Misc] Remove unused cuda_utils.h in CPU backend ( #5345 )
2024-06-07 14:09:13 -07:00
Roger Wang
7a9cb294ae
[Frontend] Add OpenAI Vision API Support ( #5237 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-06-07 11:23:32 -07:00
Dipika Sikka
ca3ea51bde
[Kernel] Dynamic Per-Token Activation Quantization ( #5037 )
...
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-06-07 09:36:26 -07:00
limingshu
dc49fb892c
Addition of lacked ignored_seq_groups in _schedule_chunked_prefill ( #5296 )
2024-06-07 13:35:42 +00:00
Antoni Baum
18a277b52d
Remove Ray health check ( #4693 )
2024-06-07 10:01:56 +00:00
Tyler Michael Smith
8d75fe48ca
[Kernel] Switch fp8 layers to use the CUTLASS kernels ( #5183 )
...
Switching from torch._scaled_mm to vLLM's cutlass fp8 kernels when supported as we are seeing 5-15% improvement in e2e performance on neuralmagic/Meta-Llama-3-8B-Instruct-FP8
see https://docs.google.com/spreadsheets/d/1GiAnmzyGHgZ6zL_LDSTm35Bdrt4A8AaFEurDlISYYA4/ for some quick e2e benchmarks and #5144 for comparisons across different GEMM sizes.
2024-06-07 08:42:35 +00:00