Michael Goin
978aed5300
[Kernel][Attention] Separate Attention.kv_scale
into k_scale
and v_scale
( #6081 )
2024-07-16 15:31:32 -07:00
Cyrus Leung
0e9164b40a
[mypy] Enable type checking for test directory ( #5017 )
2024-06-15 04:45:31 +00:00
SnowDist
a22dea54d3
[Model] Support MAP-NEO model ( #5081 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-05-30 19:24:41 -07:00
Cyrus Leung
350f9e107f
[CI/Build] Move test_utils.py
to tests/utils.py
( #4425 )
...
Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time)
Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.
2024-05-13 23:50:09 +09:00
Cody Yu
c833101740
[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support ( #4535 )
2024-05-09 18:04:17 -06:00
SangBin Cho
3521ba4f25
[Core][Model runner refactoring 1/N] Refactor attn metadata term ( #4518 )
2024-05-03 10:20:12 -07:00
Kunshang Ji
e9da5a40c6
[Misc] Add indirection layer for custom ops ( #3913 )
2024-04-10 20:26:07 -07:00
Adrian Abeyta
2ff767b513
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) ( #3290 )
...
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-04-03 14:15:55 -07:00
SangBin Cho
01bfb22b41
[CI] Try introducing isort. ( #3495 )
2024-03-25 07:59:47 -07:00
Hongxia Yang
56f738ae9b
[ROCm] Fix some kernels failed unit tests ( #2498 )
2024-02-05 14:25:36 -08:00
Kunshang Ji
96b6f475dd
Remove hardcoded device="cuda"
to support more devices ( #2503 )
...
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2024-02-01 15:46:39 -08:00
zhaoyang-star
9090bf02e7
Support FP8-E5M2 KV Cache ( #2279 )
...
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-28 16:43:54 -08:00
Simon Mo
6e01e8c1c8
[CI] Add Buildkite ( #2355 )
2024-01-14 12:37:58 -08:00
Jee Li
77af974b40
[FIX] Support non-zero CUDA devices in custom kernels ( #1959 )
2024-01-02 19:09:59 -08:00
wbn
dacaf5a400
Replace head_mapping params with num_kv_heads to attention kernel. ( #1997 )
...
Co-authored-by: wangguoya <wangguoya@baidu.com>
Co-authored-by: Yang Zhao <zhaoyangstar@foxmail.com>
2023-12-10 10:12:53 -08:00
Yanming W
e0c6f556e8
[Build] Avoid building too many extensions ( #1624 )
2023-11-23 16:31:19 -08:00
Simon Mo
5ffc0d13a2
Migrate linter from pylint
to ruff
( #1665 )
2023-11-20 11:58:01 -08:00
Woosuk Kwon
0ce8647dc5
Fix integer overflows in attention & cache ops ( #1514 )
2023-10-31 15:19:30 -07:00
Woosuk Kwon
928de46888
Implement PagedAttention V2 ( #1348 )
2023-10-16 00:59:57 -07:00
Woosuk Kwon
6f88f762bf
Fix OOM in attention kernel test ( #1223 )
2023-09-28 14:33:24 -07:00
Antoni Baum
cf5cb1e33e
Allocate more shared memory to attention kernel ( #1154 )
2023-09-26 22:27:13 -07:00
Zhuohan Li
db09d4ad83
[FIX] Fix Alibi implementation in PagedAttention kernel ( #945 )
...
* [FIX] Fix Alibi implementation in PagedAttention kernel
* Fix test_attention
* Fix
---------
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Oliver-ss <yuansongwx@outlook.com>
2023-09-07 15:53:14 -07:00
Woosuk Kwon
fbd80ad409
Clean up kernel unit tests ( #938 )
2023-09-05 16:57:38 -07:00
Aman Gupta Karmani
75471386de
use flash-attn via xformers ( #877 )
2023-08-29 21:52:13 -07:00
Tao Peng
d7a1c6d614
Fix paged attention testing. ( #495 )
...
Signed-off-by: Tao Peng <jiankeng.pt@alibaba-inc.com>
2023-07-24 21:01:56 -07:00
Song
bda41c70dd
hotfix attn alibi wo head mapping ( #496 )
...
Co-authored-by: oliveryuan <oliveryuan@basemind.com>
2023-07-18 11:31:48 -07:00
Andre Slavescu
c894836108
[Model] Add support for GPT-J ( #226 )
...
Co-authored-by: woWoosuk Kwon <woosuk.kwon@berkeley.edu>
2023-07-08 17:55:16 -07:00
Woosuk Kwon
e41f06702c
Add support for BLOOM ( #331 )
2023-07-03 13:12:35 -07:00
Zhuohan Li
d6fa1be3a8
[Quality] Add code formatter and linter ( #326 )
2023-07-03 11:31:55 -07:00
Woosuk Kwon
0b98ba15c7
Change the name to vLLM ( #150 )
2023-06-17 03:07:40 -07:00
Woosuk Kwon
e38074b1e6
Support FP32 ( #141 )
2023-06-07 00:40:21 -07:00
Woosuk Kwon
825d8892b5
Use pytest format for unit tests ( #107 )
2023-05-17 17:11:23 -07:00