Kuntai Du
ccb63a8245
[Core][Hash][Automatic Prefix caching] Accelerating the hashing function by avoiding deep copies ( #4696 )
2024-05-14 21:34:33 +09:00
Zhuohan Li
c579b750a0
[Doc] Add meetups to the doc ( #4798 )
2024-05-13 18:48:00 -07:00
Cyrus Leung
4bfa7e7f75
[Doc] Add API reference for offline inference ( #4710 )
2024-05-13 17:47:42 -07:00
Zhuohan Li
ac1fbf7fd2
[Doc] Shorten README by removing supported model list ( #4796 )
2024-05-13 16:23:54 -07:00
Philipp Moritz
33d3914b1e
[Bugfix] Fix dynamic FP8 quantization for Mixtral ( #4793 )
2024-05-13 19:00:27 -04:00
Stephen Krider
1356df53bd
[Kernel] Use flash-attn for decoding ( #3648 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
2024-05-13 15:50:33 -07:00
Cody Yu
ce532ff45c
[Speculative decoding] Improve n-gram efficiency ( #4724 )
2024-05-13 15:00:13 -07:00
Sanger Steel
8bc68e198c
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update tensorizer
to version 2.9.0 ( #4208 )
2024-05-13 14:57:07 -07:00
Woosuk Kwon
0fca3cdcf2
[Misc] Enhance attention selector ( #4751 )
2024-05-13 10:47:25 -07:00
SangBin Cho
e7c46b9527
[Scheduler] Warning upon preemption and Swapping ( #4647 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-05-13 23:50:44 +09:00
Cyrus Leung
350f9e107f
[CI/Build] Move test_utils.py
to tests/utils.py
( #4425 )
...
Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time)
Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.
2024-05-13 23:50:09 +09:00
youkaichao
702bee461f
[Core][Distributed] refactor custom allreduce to support multiple tp groups ( #4754 )
2024-05-12 17:47:59 -07:00
Swapnil Parekh
a7be4d0072
[CORE] Improvement in ranks code ( #4718 )
2024-05-12 17:47:47 -07:00
Robert Shaw
a709e87a4f
[CI/Build] Tweak Marlin Nondeterminism Issues ( #4713 )
2024-05-12 17:46:31 -07:00
Yikang Shen
6eaccb7353
[Model] Add support for IBM Granite Code models ( #4636 )
2024-05-11 21:27:24 -07:00
Chang Su
e254497b66
[Model][Misc] Add e5-mistral-7b-instruct and Embedding API ( #3734 )
2024-05-11 11:30:37 -07:00
youkaichao
4e12131089
[Core][Test] fix function name typo in custom allreduce ( #4750 )
2024-05-10 15:14:40 -07:00
Robert Shaw
fcc2994be6
[CI] Nits for bad initialization of SeqGroup in testing ( #4748 )
2024-05-10 18:01:01 -04:00
heeju-kim2
2e7796f2cf
[Speculative decoding] CUDA graph support ( #4295 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com>
2024-05-10 17:36:25 +00:00
Allen.Dou
706588a77d
[Bugfix] Fix CLI arguments in OpenAI server docs ( #4729 )
2024-05-11 00:00:56 +09:00
SangBin Cho
6a0f617210
[Core] Fix circular reference which leaked llm instance in local dev env ( #4737 )
...
Storing exception frame is extremely prone to circular refernece because it contains the reference to objects.
When tensorizer is not installed, it leaks llm instance because error frame has references to various modules which cause circular reference problem.
I also found spec decoding has a circular reference issue, and I solved it using weakref.proxy.
2024-05-10 23:54:32 +09:00
Steve Grubb
dac6a3f6ed
[Misc] Apply a couple g++ cleanups ( #4719 )
2024-05-10 13:37:05 +00:00
Kunshang Ji
64b77dfd7e
[Core]fix type annotation for swap_blocks
( #4726 )
2024-05-10 21:52:48 +09:00
Simon Mo
51d4094fda
chunked-prefill-doc-syntax ( #4603 )
...
Fix the docs: https://docs.vllm.ai/en/latest/models/performance.html
Co-authored-by: sang <rkooo567@gmail.com>
2024-05-10 14:13:23 +09:00
Allen.Dou
e965d46184
[Misc] Keep only one implementation of the create_dummy_prompt function. ( #4716 )
2024-05-09 21:42:38 -07:00
youkaichao
208b71bcc1
[Core][Distributed] refactor pynccl ( #4591 )
...
[Core][Distributed] refactor pynccl to hold multiple communicators (#4591 )
2024-05-09 19:48:43 -07:00
Cody Yu
c833101740
[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support ( #4535 )
2024-05-09 18:04:17 -06:00
Philipp Moritz
379da6dcb5
[Kernel] [FP8] Improve FP8 linear layer performance ( #4691 )
...
This PR improves the FP8 performance of linear layers, which had been lacking before (#4118 (comment) and #4118 (comment)).
We noticed that CUBLASLt can find a better algorithm if the first dimension of the matrix is greater than 16. So this PR enlarges matrices appropriately during quantization. This improves FP8 performance and removes the performance regression vs. FP16, in many cases exceeding FP16 performance.
Here are benchmarks on llama3 70b (ITL numbers for 1000 input and 50 output tokens at fixed qps and at TP 4), all FP8 measurements are for dynamic quantization:
qps = 1: 24 ms (FP8, this PR), 32 ms (FP8, previous main), 26 ms (FP16)
qps = 2: 26 ms (FP8, this PR), 34ms (FP8, previous main), 28 ms (FP16)
qps = 4: 33 ms (FP8, this PR), 44 ms (FP8, previous main), 36 ms (FP16)
qps = 6: 46 ms (FP8, this PR), 56 ms (FP8, previous main), 54 ms (FP16)
qps = 8: 85 ms (FP8, this PR), 85 ms (FP8, previous main), 138 ms (FP16)
2024-05-09 16:38:07 -07:00
Hao Zhang
ebce310b74
[Model] Snowflake arctic model implementation ( #4652 )
...
Co-authored-by: Dash Desai <1723932+iamontheinet@users.noreply.github.com>
Co-authored-by: Aurick Qiao <qiao@aurick.net>
Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com>
Co-authored-by: Aurick Qiao <aurickq@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-05-09 22:37:14 +00:00
Michael Goin
be0c5180ac
[Bugfix] Add logs for all model dtype casting ( #4717 )
2024-05-09 18:36:25 +00:00
Robert Shaw
cea64430f6
[Bugfix] Update grafana.json ( #4711 )
2024-05-09 10:10:13 -07:00
Cyrus Leung
a3c124570a
[Bugfix] Fix CLI arguments in OpenAI server docs ( #4709 )
2024-05-09 09:53:14 -07:00
kliuae
ff5abcd746
[ROCm] Add support for Punica kernels on AMD GPUs ( #3140 )
...
Co-authored-by: miloice <jeffaw99@hotmail.com>
2024-05-09 09:19:50 -07:00
Woosuk Kwon
0ee535b294
[Misc] Set block size at initialization & Fix test_model_runner ( #4705 )
2024-05-09 09:04:59 -07:00
Woosuk Kwon
190bc838e1
[Misc] Remove unnecessary ModelRunner imports ( #4703 )
2024-05-09 00:17:17 -07:00
Cyrus Leung
f12b20decc
[Frontend] Move async logic outside of constructor ( #4674 )
2024-05-08 22:48:33 -07:00
Mahmoud Ashraf
16bc0a098f
[Frontend] add tok/s speed metric to llm class when using tqdm ( #4400 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-05-08 22:02:31 -07:00
alexm-nm
e288df0632
[Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin ( #4626 )
2024-05-08 17:14:31 -07:00
Cade Daniel
8b9241be3a
[Speculative decoding] [Bugfix] Fix overallocation in ngram + spec logprobs ( #4672 )
2024-05-08 23:24:46 +00:00
Cody Yu
f942efb5a3
[Dynamic Spec Decoding] Auto-disable by the running queue size ( #4592 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com>
2024-05-08 21:44:00 +00:00
Woosuk Kwon
89579a201f
[Misc] Use vllm-flash-attn instead of flash-attn ( #4686 )
2024-05-08 13:15:34 -07:00
youkaichao
230c4b38c1
[CI/Test] fix swap test for multi gpu ( #4689 )
2024-05-08 13:14:02 -07:00
youkaichao
20cfcdec99
[Core][Optimization] change python dict to pytorch tensor for blocks to swap ( #4659 )
2024-05-08 12:07:05 -07:00
Antoni Baum
ad932a221d
[Core] Faster startup for LoRA enabled models ( #4634 )
2024-05-08 10:33:18 -07:00
Woosuk Kwon
5510cf0e8a
[Misc] Add get_name
method to attention backends ( #4685 )
2024-05-08 09:59:31 -07:00
DefTruth
0f9a6e3d22
[Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi ( #4573 )
2024-05-08 09:19:58 -07:00
SangBin Cho
f6a593093a
[CI] Make mistral tests pass ( #4596 )
2024-05-08 08:44:35 -07:00
SangBin Cho
d7740ea4dc
[Core] Optimize sampler get_logprobs ( #4594 )
2024-05-08 08:42:28 -07:00
youkaichao
cc466a3290
[Core][Distributed] support cpu&device in broadcast tensor dict ( #4660 )
...
[Core][Distributed] support both cpu and device tensor in broadcast tensor dict (#4660 )
2024-05-07 19:34:47 -07:00
leiwen83
8344f7742b
[Bug fix][Core] fixup ngram not setup correctly ( #4551 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-05-07 11:40:18 -07:00