67 Commits

Author SHA1 Message Date
Dipika Sikka
8cef6e02dc
[Misc] add w8a8 asym models (#11075) 2024-12-23 13:33:20 -05:00
Tyler Michael Smith
5a9da2e6e9
[Bugfix][Build/CI] Fix sparse CUTLASS compilation on CUDA [12.0, 12.2) (#11311)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-12-19 02:43:30 +00:00
Dipika Sikka
60508ffda9
[Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support (#10995)
Co-authored-by: Faraz Shahsavan <faraz.shahsavan@gmail.com>
Co-authored-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
2024-12-18 09:57:16 -05:00
youkaichao
388ee3de66
[torch.compile] limit inductor threads and lazy import quant (#10482)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-20 18:36:33 -08:00
Yan Ma
6b2d25efc7
[Hardware][XPU] AWQ/GPTQ support for xpu backend (#10107)
Signed-off-by: yan ma <yan.ma@intel.com>
2024-11-18 11:18:05 -07:00
Luka Govedič
bf2ddc6610
[bugfix] Fix static asymmetric quantization case (#10334)
Signed-off-by: Daniël de Kok <me@danieldk.eu>
Signed-off-by: luka <luka@neuralmagic.com>
Co-authored-by: Daniël de Kok <me@danieldk.eu>
2024-11-15 09:35:11 +08:00
HoangCongDuc
ac49b59d8b
[Bugfix] bitsandbytes models fail to run pipeline parallel (#10200)
Signed-off-by: Hoang Cong Duc <hoangcongducltt@gmail.com>
2024-11-13 09:56:39 -07:00
Joe Runde
380e18639f
🐛 fix torch memory profiling (#9516)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-10-18 21:25:19 -04:00
Cyrus Leung
051eaf6db3
[Model] Add user-configurable task for models that support both generation and embedding (#9424) 2024-10-18 11:31:58 -07:00
Michael Goin
22f8a69549
[Misc] Directly use compressed-tensors for checkpoint definitions (#8909)
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-10-15 15:40:25 -07:00
Li, Jiang
ca77dd7a44
[Hardware][CPU] Support AWQ for CPU backend (#7515) 2024-10-09 10:28:08 -06:00
chenqianfzh
2f4117c38e
support bitsandbytes quantization with more models (#9148) 2024-10-08 19:52:19 -06:00
Luka Govedič
172d1cd276
[Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method (#7271) 2024-09-27 14:25:10 -04:00
Jee Jee Li
13f9f7a3d0
[[Misc]Upgrade bitsandbytes to the latest version 0.44.0 (#8768) 2024-09-24 17:08:55 -07:00
Cyrus Leung
6ffa3f314c
[CI/Build] Avoid CUDA initialization (#8534) 2024-09-18 10:38:11 +00:00
chenqianfzh
9855b99502
[Feature][kernel] tensor parallelism with bitsandbytes quantization (#8434) 2024-09-17 08:09:12 -07:00
youkaichao
a2469127db
[misc][ci] fix quant test (#8449) 2024-09-13 17:20:14 +08:00
Li, Jiang
0b952af458
[Hardware][Intel] Support compressed-tensor W8A8 for CPU backend (#7257) 2024-09-11 09:46:46 -07:00
chenqianfzh
4664ceaad6
support bitsandbytes 8-bit and FP4 quantized models (#7445) 2024-08-29 19:09:08 -04:00
Dipika Sikka
fc911880cc
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7766)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
2024-08-27 15:07:09 -07:00
youkaichao
9e51b6a626
[ci][test] adjust max wait time for cpu offloading test (#7709) 2024-08-20 17:12:44 -07:00
Mor Zusman
7fc23be81c
[Kernel] W8A16 Int8 inside FusedMoE (#7415) 2024-08-16 10:06:51 -07:00
jon-chuang
50b8d08dbd
[Misc/Testing] Use torch.testing.assert_close (#7324) 2024-08-16 04:24:04 +00:00
Michael Goin
e165528778
[CI] Move quantization cpu offload tests out of fastcheck (#7574) 2024-08-15 21:16:20 -07:00
Kyle Sayers
f55a9aea45
[Misc] Revert compressed-tensors code reuse (#7521) 2024-08-14 15:07:37 -07:00
Kyle Sayers
373538f973
[Misc] compressed-tensors code reuse (#7277) 2024-08-13 19:05:15 -04:00
Michael Goin
5223199e03
[Bugfix][FP8] Fix dynamic FP8 Marlin quantization (#7219) 2024-08-07 11:23:12 -07:00
Dipika Sikka
0f7052bc7e
[Misc] Refactor linear layer weight loading; introduce BasevLLMParameter and weight_loader_v2 (#5874) 2024-08-07 09:17:58 -07:00
Isotr0py
360bd67cf0
[Core] Support loading GGUF model (#5191)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-05 17:54:23 -06:00
Michael Goin
fb3db61688
[CI/Build] Remove sparseml requirement from testing (#7037) 2024-08-01 12:00:51 -07:00
Tyler Michael Smith
d7a299edaa
[Kernel] Remove scaled_fp8_quant kernel padding footgun (#6842) 2024-07-30 16:37:01 -04:00
Michael Goin
65b1f121c8
[Bugfix] Fix kv_cache_dtype=fp8 without scales for FP8 checkpoints (#6761) 2024-07-25 09:46:15 -07:00
dongmao zhang
87525fab92
[bitsandbytes]: support read bnb pre-quantized model (#5753)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-07-23 23:45:09 +00:00
Michael Goin
01c16ede6b
[CI] Add smoke test for non-uniform AutoFP8 quantization (#6702) 2024-07-23 22:45:12 +00:00
Michael Goin
9e0b558a09
[Misc] Support FP8 kv cache scales from compressed-tensors (#6528) 2024-07-23 04:11:50 +00:00
Alexander Matveev
396d92d5e0
[Kernel][Core] Add AWQ support to the Marlin kernel (#6612) 2024-07-21 19:41:42 -04:00
Michael Goin
978aed5300
[Kernel][Attention] Separate Attention.kv_scale into k_scale and v_scale (#6081) 2024-07-16 15:31:32 -07:00
Robert Shaw
b675069d74
[ Misc ] Refactor Marlin Python Utilities (#6082)
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
2024-07-11 15:40:11 +00:00
Robert Shaw
abfe705a02
[ Misc ] Support Fp8 via llm-compressor (#6110)
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-07-07 20:42:11 +00:00
Robert Shaw
62963d129e
[ Misc ] Clean Up CompressedTensorsW8A8 (#6113) 2024-07-03 22:50:08 +00:00
Michael Goin
47f0954af0
[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (#5975) 2024-07-03 17:38:00 +00:00
youkaichao
482045ee77
[hardware][misc] introduce platform abstraction (#6080) 2024-07-02 20:12:22 -07:00
Qubitium-ModelCloud
ee93f4f92a
[CORE] Quantized lm-head Framework (#4442)
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
Co-authored-by: ZX <zx@lbx.dev>
2024-07-02 22:25:17 +00:00
youkaichao
614aa51203
[misc][cuda] use nvml to avoid accidentally cuda initialization (#6007) 2024-06-30 20:07:34 -07:00
Robert Shaw
af9ad46fca
[ Misc ] Refactor w8a8 to use process_weights_after_load (Simplify Weight Loading) (#5940)
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-06-30 23:06:27 +00:00
Dipika Sikka
dd248f7675
[Misc] Update w4a16 compressed-tensors support to include w8a16 (#5794) 2024-06-25 19:23:35 +00:00
Dipika Sikka
4a30d7e3cc
[Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes (#5650) 2024-06-19 18:06:44 -04:00
Dipika Sikka
95db455e7f
[Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization (#5542) 2024-06-18 12:45:05 -04:00
Dipika Sikka
890d8d960b
[Kernel] compressed-tensors marlin 24 support (#5435) 2024-06-17 12:32:48 -04:00
Michael Goin
4a6769053a
[CI][BugFix] Flip is_quant_method_supported condition (#5577) 2024-06-16 14:07:34 +00:00