Lucas Wilkinson
a8d604ca2a
[Misc] Disambiguate quantized types via a new ScalarType ( #6396 )
2024-08-02 13:51:58 -07:00
HandH1998
6512937de1
Support W4A8 quantization for vllm ( #5218 )
2024-07-31 07:55:21 -06:00
Alexander Matveev
75acdaa4b6
[Kernel] Increase precision of GPTQ/AWQ Marlin kernel ( #6795 )
2024-07-27 17:52:33 -04:00
Alexander Matveev
396d92d5e0
[Kernel][Core] Add AWQ support to the Marlin kernel ( #6612 )
2024-07-21 19:41:42 -04:00
Varun Sundar Rabindranath
2e26564259
[ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub ( #6593 )
...
Co-authored-by: Varun Sundar Rabindranth <varun@neuralmagic.com>
2024-07-19 18:15:26 -07:00
Varun Sundar Rabindranath
b5241e41d9
[ Kernel ] FP8 Dynamic-Per-Token Quant Kernel ( #6511 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-07-18 01:38:35 +00:00
Alexander Matveev
e76466dde2
[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step ( #6338 )
2024-07-17 14:30:28 -07:00
Michael Goin
978aed5300
[Kernel][Attention] Separate Attention.kv_scale
into k_scale
and v_scale
( #6081 )
2024-07-16 15:31:32 -07:00
Michael Goin
47f0954af0
[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin ( #5975 )
2024-07-03 17:38:00 +00:00
Chip Kerchner
38a1674abb
Support CPU inference with VSX PowerPC ISA ( #5652 )
2024-06-26 21:53:04 +00:00
Luka Govedič
5bfd1bbc98
[Kernel] Adding bias epilogue support for cutlass_scaled_mm
( #5560 )
...
Co-authored-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2024-06-26 15:16:00 +00:00
Tyler Michael Smith
3f3b6b2150
[Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels ( #5715 )
2024-06-20 18:36:10 +00:00
Roger Wang
ad137cd111
[Model] Port over CLIPVisionModel for VLMs ( #5591 )
2024-06-20 11:52:09 +00:00
Tyler Michael Smith
85657b5607
[Kernel] Factor out epilogues from cutlass kernels ( #5391 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: zifeitong <zifei.tong@parasail.io>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-06-13 11:22:19 -07:00
bnellnm
5467ac3196
[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops ( #5047 )
2024-06-09 16:23:30 -04:00
Dipika Sikka
ca3ea51bde
[Kernel] Dynamic Per-Token Activation Quantization ( #5037 )
...
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-06-07 09:36:26 -07:00
Tyler Michael Smith
cbb2f59cc8
[Kernel] Pass a device pointer into the quantize kernel for the scales ( #5159 )
2024-06-03 09:52:30 -07:00
Eric Xihui Lin
8e192ff967
[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model ( #4799 )
...
Co-authored-by: beagleski <yunanzhang@microsoft.com>
Co-authored-by: bapatra <bapatra@microsoft.com>
Co-authored-by: Barun Patra <codedecde@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-05-24 22:00:52 -07:00
Dipika Sikka
a1242324c9
[Kernel] Initial Activation Quantization Support ( #4525 )
...
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-05-23 21:29:18 +00:00
Michael Goin
5f6d10c14c
[CI/Build] Enforce style for C++ and CUDA code with clang-format
( #4722 )
2024-05-22 07:18:41 +00:00
Tyler Michael Smith
2060e93659
[Kernel] Add w8a8 CUTLASS kernels ( #4749 )
2024-05-16 18:32:50 -04:00
Alexander Matveev
6979ade384
Add GPTQ Marlin 2:4 sparse structured support ( #4790 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
2024-05-16 12:56:15 -04:00
SangBin Cho
3521ba4f25
[Core][Model runner refactoring 1/N] Refactor attn metadata term ( #4518 )
2024-05-03 10:20:12 -07:00
alexm-nm
7038e8b803
[Kernel] Support running GPTQ 8-bit models in Marlin ( #4533 )
2024-05-02 12:56:22 -04:00
Robert Shaw
73c8d677e5
[Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin ( #3922 )
...
Co-authored-by: alexm <alexm@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-04-29 09:35:34 -07:00
Philipp Moritz
12628d3c78
[Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales ( #4343 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-04-27 04:49:59 +00:00
Philipp Moritz
eace8bf0b9
[Kernel] FP8 support for MoE kernel / Mixtral ( #4244 )
...
This PR is the first step towards fixing https://github.com/vllm-project/vllm/pull/3208
It implements dynamic per-tensor scaling (see https://github.com/vllm-project/vllm/pull/4118 ), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints. It is enough to specify the `quantization="fp8"` argument. You can try out the PR like this:
```python
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2, quantization="fp8")
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
**Performance**: For this PR, the focus is on making the code clean (while still trying to get reasonable performance), there is a bunch of optimizations that we will submit as a follow up PR that significantly improve the performance (similar to the numbers in https://github.com/vllm-project/vllm/pull/3954 ). With this PR, the results are as follows:
<img width="725" alt="Screenshot 2024-04-21 at 1 31 50 PM" src="https://github.com/vllm-project/vllm/assets/113316/d8fe1118-07a0-4d4e-8530-37a77d465a03 ">
**Accuracy**: The accuracy with this PR on MMLU on `mistralai/Mixtral-8x7B-v0.1` is as follows:
```
| Groups |Version|Filter|n-shot|Metric|Value | |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu |N/A |none | 0|acc |0.7018|± |0.0036|
| - humanities |N/A |none | 5|acc |0.6472|± |0.0065|
| - other |N/A |none | 5|acc |0.7673|± |0.0072|
| - social_sciences|N/A |none | 5|acc |0.8099|± |0.0070|
| - stem |N/A |none | 5|acc |0.6131|± |0.0083|
```
this compares favorably with the fp16 results which are
```
| Groups |Version|Filter|n-shot|Metric|Value | |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu |N/A |none | 0|acc |0.7020|± |0.1313|
| - humanities |N/A |none | 5|acc |0.6425|± |0.1349|
| - other |N/A |none | 5|acc |0.7744|± |0.1038|
| - social_sciences|N/A |none | 5|acc |0.8131|± |0.0695|
| - stem |N/A |none | 5|acc |0.6108|± |0.1383|
```
Happy hacking!
2024-04-24 01:18:23 +00:00
James Fleming
2b7949c1c2
AQLM CUDA support ( #3287 )
...
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-04-23 13:59:33 -04:00
Adrian Abeyta
2ff767b513
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) ( #3290 )
...
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-04-03 14:15:55 -07:00
Terry
7e9bd08f60
Add batched RoPE kernel ( #3095 )
2024-03-13 13:45:26 -07:00
Woosuk Kwon
602358f8a8
Add kernel for GeGLU with approximate GELU ( #3337 )
2024-03-12 22:06:17 -07:00
Robert Shaw
c0c2335ce0
Integrate Marlin Kernels for Int4 GPTQ inference ( #2497 )
...
Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>
2024-03-01 12:47:51 -08:00
CHU Tianxiang
01a5d18a53
Add Support for 2/3/8-bit GPTQ Quantization Models ( #2330 )
2024-02-28 21:52:23 -08:00
Woosuk Kwon
fd5dcc5c81
Optimize GeGLU layer in Gemma ( #2975 )
2024-02-21 20:17:52 -08:00
Philipp Moritz
ab40644669
Fused MOE for Mixtral ( #2542 )
...
Co-authored-by: chen shen <scv119@gmail.com>
2024-01-29 22:43:37 -08:00
wangding zeng
5d60def02c
DeepseekMoE support with Fused MoE kernel ( #2453 )
...
Co-authored-by: roy <jasonailu87@gmail.com>
2024-01-29 21:19:48 -08:00
zhaoyang-star
9090bf02e7
Support FP8-E5M2 KV Cache ( #2279 )
...
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-28 16:43:54 -08:00
Hanzhi Zhou
380170038e
Implement custom all reduce kernels ( #2192 )
2024-01-27 12:46:35 -08:00
Casper
beb89f68b4
AWQ: Up to 2.66x higher throughput ( #2566 )
2024-01-26 23:53:17 -08:00
CHU Tianxiang
0fbfc4b81b
Add GPTQ support ( #916 )
2023-12-15 03:04:22 -08:00
Mingcan Xiang
614856da25
Avoid multiple redefinition ( #1817 )
2023-12-14 09:35:58 -08:00
wbn
dacaf5a400
Replace head_mapping params with num_kv_heads to attention kernel. ( #1997 )
...
Co-authored-by: wangguoya <wangguoya@baidu.com>
Co-authored-by: Yang Zhao <zhaoyangstar@foxmail.com>
2023-12-10 10:12:53 -08:00
TJian
6ccc0bfffb
Merge EmbeddedLLM/vllm-rocm into vLLM main ( #1836 )
...
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Amir Balwel <amoooori04@gmail.com>
Co-authored-by: root <kuanfu.liu@akirakan.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: kuanfu <kuanfu.liu@embeddedllm.com>
Co-authored-by: miloice <17350011+kliuae@users.noreply.github.com>
2023-12-07 23:16:52 -08:00
Yanming W
e0c6f556e8
[Build] Avoid building too many extensions ( #1624 )
2023-11-23 16:31:19 -08:00