89 Commits

Author SHA1 Message Date
Jee Li
566b57c5c4
[Kernel] support non-zero cuda devices in punica kernels (#3636) 2024-03-27 00:37:42 +00:00
Jee Li
8af890a865
Enable more models to inference based on LoRA (#3382)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-03-25 18:09:31 -07:00
Hanzhi Zhou
f721096d48
[BugFix] Some fixes for custom allreduce kernels (#2760) 2024-03-21 23:02:58 -07:00
Woosuk Kwon
9101d832e6
[Bugfix] Make moe_align_block_size AMD-compatible (#3470) 2024-03-18 11:26:24 -07:00
Simon Mo
8e67598aa6
[Misc] fix line length for entire codebase (#3444) 2024-03-16 00:36:29 -07:00
akhoroshev
78b6c4845a
Dynamically configure shared memory size for moe_align_block_size_kernel (#3376) 2024-03-14 18:18:07 -07:00
Terry
7e9bd08f60
Add batched RoPE kernel (#3095) 2024-03-13 13:45:26 -07:00
Or Sharir
ae0ccb4017
Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. (#3350) 2024-03-13 12:18:25 -07:00
Woosuk Kwon
602358f8a8
Add kernel for GeGLU with approximate GELU (#3337) 2024-03-12 22:06:17 -07:00
kliuae
c9415c19d3
[ROCm] Fix warp and lane calculation in blockReduceSum (#3321) 2024-03-11 13:14:07 -07:00
Douglas Lehr
e4a28e5316
[ROCM] Fix blockReduceSum to use correct warp counts for ROCm and CUDA (#3262) 2024-03-10 15:27:45 -07:00
Terry
0bba88df03
Enhance lora tests with more layer and rank variations (#3243) 2024-03-09 17:14:16 -08:00
whyiug
c59e120c55
Feature add lora support for Qwen2 (#3177) 2024-03-07 21:58:24 -08:00
Robert Shaw
c0c2335ce0
Integrate Marlin Kernels for Int4 GPTQ inference (#2497)
Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>
2024-03-01 12:47:51 -08:00
CHU Tianxiang
01a5d18a53
Add Support for 2/3/8-bit GPTQ Quantization Models (#2330) 2024-02-28 21:52:23 -08:00
Woosuk Kwon
929b4f2973
Add LoRA support for Gemma (#3050) 2024-02-28 13:03:28 -08:00
Woosuk Kwon
d6e4a130b0
[Minor] Remove gather_cached_kv kernel (#3043) 2024-02-26 15:00:54 -08:00
Woosuk Kwon
fd5dcc5c81
Optimize GeGLU layer in Gemma (#2975) 2024-02-21 20:17:52 -08:00
Rex
563836496a
Refactor 2 awq gemm kernels into m16nXk32 (#2723)
Co-authored-by: Chunan Zeng <chunanzeng@Chunans-Air.attlocal.net>
2024-02-12 11:02:17 -08:00
Woosuk Kwon
f0d4e14557
Add fused top-K softmax kernel for MoE (#2769) 2024-02-05 17:38:02 -08:00
zhaoyang-star
923797fea4
Fix compile error when using rocm (#2648) 2024-02-01 09:35:09 -08:00
Philipp Moritz
ab40644669
Fused MOE for Mixtral (#2542)
Co-authored-by: chen shen <scv119@gmail.com>
2024-01-29 22:43:37 -08:00
wangding zeng
5d60def02c
DeepseekMoE support with Fused MoE kernel (#2453)
Co-authored-by: roy <jasonailu87@gmail.com>
2024-01-29 21:19:48 -08:00
Hanzhi Zhou
1b20639a43
No repeated IPC open (#2642) 2024-01-29 10:46:29 -08:00
zhaoyang-star
9090bf02e7
Support FP8-E5M2 KV Cache (#2279)
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-28 16:43:54 -08:00
Woosuk Kwon
f8ecb84c02
Speed up Punica compilation (#2632) 2024-01-27 17:46:56 -08:00
Hanzhi Zhou
380170038e
Implement custom all reduce kernels (#2192) 2024-01-27 12:46:35 -08:00
Casper
beb89f68b4
AWQ: Up to 2.66x higher throughput (#2566) 2024-01-26 23:53:17 -08:00
Hongxia Yang
6b7de1a030
[ROCm] add support to ROCm 6.0 and MI300 (#2274) 2024-01-26 12:41:10 -08:00
Vladimir
5265631d15
use a correct device when creating OptionalCUDAGuard (#2583) 2024-01-25 23:48:17 -08:00
Antoni Baum
9b945daaf1
[Experimental] Add multi-LoRA support (#1804)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-01-23 15:26:37 -08:00
Woosuk Kwon
6ef00b03a2
Enable CUDA graph for GPTQ & SqueezeLLM (#2318) 2024-01-03 09:52:29 -08:00
Jee Li
77af974b40
[FIX] Support non-zero CUDA devices in custom kernels (#1959) 2024-01-02 19:09:59 -08:00
kliuae
1b7c791d60
[ROCm] Fixes for GPTQ on ROCm (#2180) 2023-12-18 10:41:04 -08:00
Woosuk Kwon
76a7983b23
[BugFix] Fix RoPE kernel on long sequences(#2164) 2023-12-17 17:09:10 -08:00
CHU Tianxiang
0fbfc4b81b
Add GPTQ support (#916) 2023-12-15 03:04:22 -08:00
Mingcan Xiang
614856da25
Avoid multiple redefinition (#1817) 2023-12-14 09:35:58 -08:00
wbn
dacaf5a400
Replace head_mapping params with num_kv_heads to attention kernel. (#1997)
Co-authored-by: wangguoya <wangguoya@baidu.com>
Co-authored-by: Yang Zhao <zhaoyangstar@foxmail.com>
2023-12-10 10:12:53 -08:00
TJian
6ccc0bfffb
Merge EmbeddedLLM/vllm-rocm into vLLM main (#1836)
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Amir Balwel <amoooori04@gmail.com>
Co-authored-by: root <kuanfu.liu@akirakan.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: kuanfu <kuanfu.liu@embeddedllm.com>
Co-authored-by: miloice <17350011+kliuae@users.noreply.github.com>
2023-12-07 23:16:52 -08:00
Yanming W
e0c6f556e8
[Build] Avoid building too many extensions (#1624) 2023-11-23 16:31:19 -08:00
ljss
e1054247ba
[Optimization] Implement fused add rmsnorm (#1667) 2023-11-18 18:18:02 -08:00
Antoni Baum
9f669a9a7c
Support YaRN models (#1264)
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Viktor Ferenczi <viktor@ferenczi.eu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-11-03 14:12:48 -07:00
Woosuk Kwon
0ce8647dc5
Fix integer overflows in attention & cache ops (#1514) 2023-10-31 15:19:30 -07:00
chooper1
1f24755bf8
Support SqueezeLLM (#1326)
Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-10-21 23:14:59 -07:00
Woosuk Kwon
c1376e0f82
Change scheduler & input tensor shape (#1381) 2023-10-16 17:48:42 -07:00
Woosuk Kwon
928de46888
Implement PagedAttention V2 (#1348) 2023-10-16 00:59:57 -07:00
Woosuk Kwon
29678cd213
Minor fix on AWQ kernel launch (#1356) 2023-10-15 21:53:56 -07:00
CHU Tianxiang
980dd4a2c4
Fix overflow in awq kernel (#1295)
Co-authored-by: 楚天翔 <tianxiang.ctx@alibaba-inc.com>
2023-10-11 00:19:53 -07:00
twaka
8285736840
workaround of AWQ for Turing GPUs (#1252) 2023-10-10 19:48:16 -07:00
Liang
ebe4d1db3a
Fix boundary check in paged attention kernel (#1241) 2023-10-01 11:35:06 -07:00