Lucas Wilkinson
|
978b45f399
|
[Kernel] Flash Attention 3 Support (#12093)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
|
2025-01-23 06:45:48 -08:00 |
|
youkaichao
|
68ad4e3a8d
|
[Core] Support fully transparent sleep mode (#11743)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2025-01-22 14:39:32 +08:00 |
|
Woosuk Kwon
|
73001445fb
|
[V1] Implement Cascade Attention (#11635)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-01-01 21:56:46 +09:00 |
|
Tyler Michael Smith
|
970d6d0776
|
[Build][Kernel] Update CUTLASS to v3.6.0 (#11607)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2024-12-30 17:22:13 +08:00 |
|
Tyler Michael Smith
|
5a9da2e6e9
|
[Bugfix][Build/CI] Fix sparse CUTLASS compilation on CUDA [12.0, 12.2) (#11311)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2024-12-19 02:43:30 +00:00 |
|
Dipika Sikka
|
60508ffda9
|
[Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support (#10995)
Co-authored-by: Faraz Shahsavan <faraz.shahsavan@gmail.com>
Co-authored-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
|
2024-12-18 09:57:16 -05:00 |
|
Luka Govedič
|
30870b4f66
|
[torch.compile] Dynamic fp8 + rms_norm fusion (#10906)
Signed-off-by: luka <luka@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2024-12-13 03:19:23 +00:00 |
|
Woosuk Kwon
|
073a4bd1c0
|
[Kernel] Use out arg in flash_attn_varlen_func (#10811)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-12-01 17:55:39 -08:00 |
|
Woosuk Kwon
|
8c1e77fb58
|
[Kernel] Update vllm-flash-attn version to reduce CPU overheads (#10742)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-11-28 08:31:28 -08:00 |
|
Woosuk Kwon
|
9a8bff0285
|
[Kernel] Update vllm-flash-attn version (#10736)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-11-28 02:25:59 -08:00 |
|
Conroy Cheers
|
f5792c7c4a
|
[Hardware][NVIDIA] Add non-NVML CUDA mode for Jetson (#9735)
Signed-off-by: Conroy Cheers <conroy@corncheese.org>
|
2024-11-26 10:26:28 -08:00 |
|
kliuae
|
7c25fe45a6
|
[AMD] Add support for GGUF quantization on ROCm (#10254)
|
2024-11-22 21:14:49 -08:00 |
|
wchen61
|
7629a9c6e5
|
[CI/Build] Support compilation with local cutlass path (#10423) (#10424)
|
2024-11-19 21:35:50 -08:00 |
|
Aleksandr Malyshev
|
812c981fa0
|
Splitting attention kernel file (#10091)
Signed-off-by: maleksan85 <maleksan@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
|
2024-11-11 22:55:07 -08:00 |
|
Luka Govedič
|
4f93dfe952
|
[torch.compile] Fuse RMSNorm with quant (#9138)
Signed-off-by: luka <luka@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@126.com>
|
2024-11-08 21:20:08 +00:00 |
|
Russell Bryant
|
098f94de42
|
[CI/Build] Drop Python 3.8 support (#10038)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-11-06 14:31:01 +00:00 |
|
Aaron Pham
|
21063c11c7
|
[CI/Build] drop support for Python 3.8 EOL (#8464)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
|
2024-11-06 07:11:55 +00:00 |
|
bnellnm
|
d93478b399
|
[Bugfix] Upgrade to pytorch 2.5.1 (#10001)
Signed-off-by: Bill Nell <bill@neuralmagic.com>
|
2024-11-04 15:11:28 -08:00 |
|
bnellnm
|
3cb07a36a2
|
[Misc] Upgrade to pytorch 2.5 (#9588)
Signed-off-by: Bill Nell <bill@neuralmagic.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
|
2024-10-27 09:44:24 +00:00 |
|
Charlie Fu
|
59449095ab
|
[Performance][Kernel] Fused_moe Performance Improvement (#9384)
Signed-off-by: charlifu <charlifu@amd.com>
|
2024-10-24 15:37:52 -07:00 |
|
Luka Govedič
|
51c24c9736
|
[Build] Fix FetchContent multiple build issue (#9596)
Signed-off-by: luka <luka@neuralmagic.com>
|
2024-10-23 12:43:07 +08:00 |
|
Lucas Wilkinson
|
d1e8240875
|
[Bugfix] Fix spurious "No compiled cutlass_scaled_mm ..." for W8A8 on Turing (#9487)
|
2024-10-22 15:41:13 -07:00 |
|
bnellnm
|
eca2c5f7c0
|
[Bugfix] Fix support for dimension like integers and ScalarType (#9299)
|
2024-10-17 19:08:34 +00:00 |
|
Lucas Wilkinson
|
717a5f82cd
|
[Bugfix][CI/Build] Fix CUDA 11.8 Build (#9386)
|
2024-10-16 00:15:21 +00:00 |
|
Lucas Wilkinson
|
de9fb4bef8
|
[Bugfix][CI/Build] Fix docker build where CUDA archs < 7.0 are being detected (#9254)
|
2024-10-11 15:57:39 -04:00 |
|
ElizaWszola
|
05d686432f
|
[Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE (#8973)
Co-authored-by: Dipika <dipikasikka1@gmail.com>
Co-authored-by: Dipika Sikka <ds3822@columbia.edu>
|
2024-10-04 12:34:44 -06:00 |
|
Lucas Wilkinson
|
22482e495e
|
[Bugfix] Flash attention arches not getting set properly (#9062)
|
2024-10-04 09:43:15 -06:00 |
|
Lucas Wilkinson
|
aeb37c2a72
|
[CI/Build] Per file CUDA Archs (improve wheel size and dev build times) (#8845)
|
2024-10-03 22:55:25 -04:00 |
|
Tyler Michael Smith
|
2e7fe7e79f
|
[Build/CI] Set FETCHCONTENT_BASE_DIR to one location for better caching (#8930)
|
2024-09-29 03:13:01 +00:00 |
|
ElizaWszola
|
a928ded995
|
[Kernel] Split Marlin MoE kernels into multiple files (#8661)
Co-authored-by: mgoin <michael@neuralmagic.com>
|
2024-09-24 09:31:42 -07:00 |
|
Lucas Wilkinson
|
86e9c8df29
|
[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (#7701)
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2024-09-23 13:46:26 -04:00 |
|
Tyler Michael Smith
|
3dda7c2250
|
[Bugfix] Avoid some bogus messages RE CUTLASS's revision when building (#8702)
|
2024-09-22 22:24:59 -04:00 |
|
youkaichao
|
0e40ac9b7b
|
[ci][build] fix vllm-flash-attn (#8699)
|
2024-09-21 23:24:58 -07:00 |
|
Luka Govedič
|
71c60491f2
|
[Kernel] Build flash-attn from source (#8245)
|
2024-09-20 23:27:10 -07:00 |
|
Charlie Fu
|
1ef0d2efd0
|
[Kernel][Hardware][Amd]Custom paged attention kernel for rocm (#8310)
|
2024-09-13 17:01:11 -07:00 |
|
Tyler Michael Smith
|
94144e726c
|
[CI/Build][Kernel] Update CUTLASS to 3.5.1 tag (#8043)
|
2024-09-10 23:51:58 +00:00 |
|
Dipika Sikka
|
23f322297f
|
[Misc] Remove SqueezeLLM (#8220)
|
2024-09-06 16:29:03 -06:00 |
|
Mor Zusman
|
fdd9daafa3
|
[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM (#7651)
|
2024-08-28 15:06:52 -07:00 |
|
Dipika Sikka
|
fc911880cc
|
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7766)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
|
2024-08-27 15:07:09 -07:00 |
|
Lucas Wilkinson
|
55d63b1211
|
[Bugfix] Don't build machete on cuda <12.0 (#7757)
|
2024-08-22 08:28:52 -04:00 |
|
Michael Goin
|
aae74ef95c
|
Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527)" (#7764)
|
2024-08-22 03:42:14 +00:00 |
|
Dipika Sikka
|
8678a69ab5
|
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
|
2024-08-21 16:17:10 -07:00 |
|
sasha0552
|
1b32e02648
|
[Bugfix] Pass PYTHONPATH from setup.py to CMake (#7730)
|
2024-08-21 11:17:48 -07:00 |
|
Lucas Wilkinson
|
5288c06aa0
|
[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel (#7174)
|
2024-08-20 07:09:33 -06:00 |
|
Daniele
|
774cd1d3bf
|
[CI/Build] bump minimum cmake version (#6999)
|
2024-08-12 16:29:20 -07:00 |
|
Isotr0py
|
360bd67cf0
|
[Core] Support loading GGUF model (#5191)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-08-05 17:54:23 -06:00 |
|
Tyler Michael Smith
|
4cf1dc39be
|
[Bugfix][CI/Build] Fix CUTLASS FetchContent (#7171)
|
2024-08-05 14:22:57 -07:00 |
|
Tyler Michael Smith
|
8571ac4672
|
[Kernel] Update CUTLASS to 3.5.1 (#7085)
|
2024-08-05 15:13:43 -04:00 |
|
Lucas Wilkinson
|
a8d604ca2a
|
[Misc] Disambiguate quantized types via a new ScalarType (#6396)
|
2024-08-02 13:51:58 -07:00 |
|
Michael Goin
|
b482b9a5b1
|
[CI/Build] Add support for Python 3.12 (#7035)
|
2024-08-02 13:51:22 -07:00 |
|