Lucas Wilkinson
aeb37c2a72
[CI/Build] Per file CUDA Archs (improve wheel size and dev build times) ( #8845 )
2024-10-03 22:55:25 -04:00
Kevin H. Luu
aaccca2b4d
[CI/Build] Fix machete generated kernel files ordering ( #8976 )
...
Signed-off-by: kevin <kevin@anyscale.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-10-01 03:33:12 +00:00
Lucas Wilkinson
86e9c8df29
[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin ( #7701 )
...
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-09-23 13:46:26 -04:00
Luka Govedič
5d73ae49d6
[Kernel] AQ AZP 3/4: Asymmetric quantization kernels ( #7270 )
2024-09-16 11:52:40 -07:00
sasha0552
781e3b9a42
[Bugfix][Kernel] Fix build for sm_60 in GGUF kernel ( #8506 )
2024-09-16 12:15:57 -06:00
Isotr0py
fc990f9795
[Bugfix][Kernel] Add IQ1_M
quantization implementation to GGUF kernel ( #8357 )
2024-09-15 16:51:44 -06:00
bnellnm
73202dbe77
[Kernel][Misc] register ops to prevent graph breaks ( #6917 )
...
Co-authored-by: Sage Moore <sage@neuralmagic.com>
2024-09-11 12:52:19 -07:00
Dipika Sikka
23f322297f
[Misc] Remove SqueezeLLM
( #8220 )
2024-09-06 16:29:03 -06:00
Lucas Wilkinson
55d63b1211
[Bugfix] Don't build machete on cuda <12.0 ( #7757 )
2024-08-22 08:28:52 -04:00
Luka Govedič
7937009a7e
[Kernel] Replaced blockReduce[...]
functions with cub::BlockReduce
( #7233 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-21 20:18:00 -04:00
Lucas Wilkinson
5288c06aa0
[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel ( #7174 )
2024-08-20 07:09:33 -06:00
bnellnm
37fd47e780
[Kernel] fix types used in aqlm and ggml kernels to support dynamo ( #7596 )
2024-08-16 14:00:11 -07:00
Charlie Fu
e837b624f2
[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm ( #7210 )
2024-08-16 10:06:30 -07:00
Lucas Wilkinson
6aa33cb2dd
[Misc] Use scalar type to dispatch to different gptq_marlin
kernels ( #7323 )
2024-08-12 14:40:13 -04:00
Luka Govedič
8d59dbb000
[Kernel] Add per-tensor and per-token AZP epilogues ( #5941 )
...
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-08-06 18:17:08 +00:00
Isotr0py
360bd67cf0
[Core] Support loading GGUF model ( #5191 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-05 17:54:23 -06:00
Tyler Michael Smith
6e4852ce28
[CI/Build] Suppress divide-by-zero and missing return statement warnings ( #7001 )
2024-08-05 16:00:01 -04:00
Tyler Michael Smith
8571ac4672
[Kernel] Update CUTLASS to 3.5.1 ( #7085 )
2024-08-05 15:13:43 -04:00
Lucas Wilkinson
a8d604ca2a
[Misc] Disambiguate quantized types via a new ScalarType ( #6396 )
2024-08-02 13:51:58 -07:00
Varun Sundar Rabindranath
35e9c12bfa
[Kernel] Tuned int8 Cutlass Kernels for SM75 (T4) ( #6996 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-07-31 14:40:32 -07:00
Varun Sundar Rabindranath
93548eb37e
[Kernel] Enable FP8 Cutlass for Ada Lovelace ( #6950 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-07-31 14:40:22 -07:00
HandH1998
6512937de1
Support W4A8 quantization for vllm ( #5218 )
2024-07-31 07:55:21 -06:00
Tyler Michael Smith
cbbc904470
[Kernel] Squash a few more warnings ( #6914 )
2024-07-30 13:50:42 -04:00
Varun Sundar Rabindranath
af647fb8b3
[Kernel] Tuned int8 kernels for Ada Lovelace ( #6848 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-07-29 20:24:58 -06:00
Tyler Michael Smith
61a97c32f6
[Kernel] Fix marlin divide-by-zero warnings ( #6904 )
2024-07-30 01:26:07 +00:00
Tyler Michael Smith
aae6d36f7e
[Kernel] Remove unused variables in awq/gemm_kernels.cu ( #6908 )
2024-07-29 18:01:17 -06:00
Tyler Michael Smith
60d1c6e584
[Kernel] Fix deprecation function warnings squeezellm quant_cuda_kernel ( #6901 )
2024-07-29 09:59:02 -07:00
Varun Sundar Rabindranath
766435e660
[Kernel] Tuned FP8 Kernels for Ada Lovelace ( #6677 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-07-29 09:42:35 -06:00
Alexander Matveev
75acdaa4b6
[Kernel] Increase precision of GPTQ/AWQ Marlin kernel ( #6795 )
2024-07-27 17:52:33 -04:00
Lucas Wilkinson
55712941e5
[Bug Fix] Illegal memory access, FP8 Llama 3.1 405b ( #6852 )
2024-07-27 02:27:44 +00:00
Tyler Michael Smith
50704f52c4
[Bugfix][Kernel] Promote another index to int64_t ( #6838 )
2024-07-26 18:41:04 +00:00
Tyler Michael Smith
fea59c7712
[Bugfix][Kernel] Use int64_t for indices in fp8 quant kernels ( #6649 )
2024-07-22 14:08:30 -06:00
Alexander Matveev
396d92d5e0
[Kernel][Core] Add AWQ support to the Marlin kernel ( #6612 )
2024-07-21 19:41:42 -04:00
Varun Sundar Rabindranath
2e26564259
[ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub ( #6593 )
...
Co-authored-by: Varun Sundar Rabindranth <varun@neuralmagic.com>
2024-07-19 18:15:26 -07:00
Varun Sundar Rabindranath
b5241e41d9
[ Kernel ] FP8 Dynamic-Per-Token Quant Kernel ( #6511 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-07-18 01:38:35 +00:00
Tyler Michael Smith
9dad5cc859
[Kernel] Turn off CUTLASS scaled_mm for Ada Lovelace ( #6384 )
2024-07-14 13:37:19 +00:00
Michael Goin
47f0954af0
[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin ( #5975 )
2024-07-03 17:38:00 +00:00
Tyler Michael Smith
6a2d659d28
[Bugfix] Fix compute datatype for cutlass 3.x epilogues ( #5931 )
2024-06-28 17:10:34 +00:00
Luka Govedič
5bfd1bbc98
[Kernel] Adding bias epilogue support for cutlass_scaled_mm
( #5560 )
...
Co-authored-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2024-06-26 15:16:00 +00:00
Varun Sundar Rabindranath
6c916ac8a8
[BugFix] [Kernel] Add Cutlass2x fallback kernels ( #5744 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-06-23 21:07:11 +00:00
Tyler Michael Smith
3f3b6b2150
[Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels ( #5715 )
2024-06-20 18:36:10 +00:00
Varun Sundar Rabindranath
a7dcc62086
[Kernel] Update Cutlass int8 kernel configs for SM80 ( #5275 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-06-20 13:33:21 +00:00
Varun Sundar Rabindranath
111af1fa2c
[Kernel] Update Cutlass int8 kernel configs for SM90 ( #5514 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-06-20 06:37:08 +00:00
Tyler Michael Smith
b23ce92032
[Bugfix] Fix CUDA version check for mma warning suppression ( #5642 )
2024-06-18 23:48:49 +00:00
Tyler Michael Smith
348616ac4b
[Kernel] Suppress mma.sp warning on CUDA 12.5 and later ( #5401 )
2024-06-14 10:02:00 -07:00
Tyler Michael Smith
703475f6c2
[Kernel] Fix CUTLASS 3.x custom broadcast load epilogue ( #5516 )
2024-06-14 09:30:15 -07:00
Tyler Michael Smith
85657b5607
[Kernel] Factor out epilogues from cutlass kernels ( #5391 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: zifeitong <zifei.tong@parasail.io>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-06-13 11:22:19 -07:00
Cody Yu
5985e3427d
[Kernel] Vectorized FP8 quantize kernel ( #5396 )
...
Inspired by #5146 , this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large).
In details, we applied 3 optimizations:
- Use inverted scale so that most divisions are changed to multiplications.
- Unroll the loop by 4 times to improve ILP.
- Use vectorized 4 to transfer data between HBM and SRAM.
2024-06-12 14:07:26 -07:00
bnellnm
5467ac3196
[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops ( #5047 )
2024-06-09 16:23:30 -04:00
Dipika Sikka
ca3ea51bde
[Kernel] Dynamic Per-Token Activation Quantization ( #5037 )
...
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-06-07 09:36:26 -07:00