Dipika Sikka
|
8cef6e02dc
|
[Misc] add w8a8 asym models (#11075)
|
2024-12-23 13:33:20 -05:00 |
|
Tyler Michael Smith
|
5a9da2e6e9
|
[Bugfix][Build/CI] Fix sparse CUTLASS compilation on CUDA [12.0, 12.2) (#11311)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2024-12-19 02:43:30 +00:00 |
|
Dipika Sikka
|
60508ffda9
|
[Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support (#10995)
Co-authored-by: Faraz Shahsavan <faraz.shahsavan@gmail.com>
Co-authored-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
|
2024-12-18 09:57:16 -05:00 |
|
youkaichao
|
388ee3de66
|
[torch.compile] limit inductor threads and lazy import quant (#10482)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-11-20 18:36:33 -08:00 |
|
Yan Ma
|
6b2d25efc7
|
[Hardware][XPU] AWQ/GPTQ support for xpu backend (#10107)
Signed-off-by: yan ma <yan.ma@intel.com>
|
2024-11-18 11:18:05 -07:00 |
|
Luka Govedič
|
bf2ddc6610
|
[bugfix] Fix static asymmetric quantization case (#10334)
Signed-off-by: Daniël de Kok <me@danieldk.eu>
Signed-off-by: luka <luka@neuralmagic.com>
Co-authored-by: Daniël de Kok <me@danieldk.eu>
|
2024-11-15 09:35:11 +08:00 |
|
HoangCongDuc
|
ac49b59d8b
|
[Bugfix] bitsandbytes models fail to run pipeline parallel (#10200)
Signed-off-by: Hoang Cong Duc <hoangcongducltt@gmail.com>
|
2024-11-13 09:56:39 -07:00 |
|
Joe Runde
|
380e18639f
|
🐛 fix torch memory profiling (#9516)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
|
2024-10-18 21:25:19 -04:00 |
|
Cyrus Leung
|
051eaf6db3
|
[Model] Add user-configurable task for models that support both generation and embedding (#9424)
|
2024-10-18 11:31:58 -07:00 |
|
Michael Goin
|
22f8a69549
|
[Misc] Directly use compressed-tensors for checkpoint definitions (#8909)
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-10-15 15:40:25 -07:00 |
|
Li, Jiang
|
ca77dd7a44
|
[Hardware][CPU] Support AWQ for CPU backend (#7515)
|
2024-10-09 10:28:08 -06:00 |
|
chenqianfzh
|
2f4117c38e
|
support bitsandbytes quantization with more models (#9148)
|
2024-10-08 19:52:19 -06:00 |
|
Luka Govedič
|
172d1cd276
|
[Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method (#7271)
|
2024-09-27 14:25:10 -04:00 |
|
Jee Jee Li
|
13f9f7a3d0
|
[[Misc]Upgrade bitsandbytes to the latest version 0.44.0 (#8768)
|
2024-09-24 17:08:55 -07:00 |
|
Cyrus Leung
|
6ffa3f314c
|
[CI/Build] Avoid CUDA initialization (#8534)
|
2024-09-18 10:38:11 +00:00 |
|
chenqianfzh
|
9855b99502
|
[Feature][kernel] tensor parallelism with bitsandbytes quantization (#8434)
|
2024-09-17 08:09:12 -07:00 |
|
youkaichao
|
a2469127db
|
[misc][ci] fix quant test (#8449)
|
2024-09-13 17:20:14 +08:00 |
|
Li, Jiang
|
0b952af458
|
[Hardware][Intel] Support compressed-tensor W8A8 for CPU backend (#7257)
|
2024-09-11 09:46:46 -07:00 |
|
chenqianfzh
|
4664ceaad6
|
support bitsandbytes 8-bit and FP4 quantized models (#7445)
|
2024-08-29 19:09:08 -04:00 |
|
Dipika Sikka
|
fc911880cc
|
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7766)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
|
2024-08-27 15:07:09 -07:00 |
|
youkaichao
|
9e51b6a626
|
[ci][test] adjust max wait time for cpu offloading test (#7709)
|
2024-08-20 17:12:44 -07:00 |
|
Mor Zusman
|
7fc23be81c
|
[Kernel] W8A16 Int8 inside FusedMoE (#7415)
|
2024-08-16 10:06:51 -07:00 |
|
jon-chuang
|
50b8d08dbd
|
[Misc/Testing] Use torch.testing.assert_close (#7324)
|
2024-08-16 04:24:04 +00:00 |
|
Michael Goin
|
e165528778
|
[CI] Move quantization cpu offload tests out of fastcheck (#7574)
|
2024-08-15 21:16:20 -07:00 |
|
Kyle Sayers
|
f55a9aea45
|
[Misc] Revert compressed-tensors code reuse (#7521)
|
2024-08-14 15:07:37 -07:00 |
|
Kyle Sayers
|
373538f973
|
[Misc] compressed-tensors code reuse (#7277)
|
2024-08-13 19:05:15 -04:00 |
|
Michael Goin
|
5223199e03
|
[Bugfix][FP8] Fix dynamic FP8 Marlin quantization (#7219)
|
2024-08-07 11:23:12 -07:00 |
|
Dipika Sikka
|
0f7052bc7e
|
[Misc] Refactor linear layer weight loading; introduce BasevLLMParameter and weight_loader_v2 (#5874)
|
2024-08-07 09:17:58 -07:00 |
|
Isotr0py
|
360bd67cf0
|
[Core] Support loading GGUF model (#5191)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-08-05 17:54:23 -06:00 |
|
Michael Goin
|
fb3db61688
|
[CI/Build] Remove sparseml requirement from testing (#7037)
|
2024-08-01 12:00:51 -07:00 |
|
Tyler Michael Smith
|
d7a299edaa
|
[Kernel] Remove scaled_fp8_quant kernel padding footgun (#6842)
|
2024-07-30 16:37:01 -04:00 |
|
Michael Goin
|
65b1f121c8
|
[Bugfix] Fix kv_cache_dtype=fp8 without scales for FP8 checkpoints (#6761)
|
2024-07-25 09:46:15 -07:00 |
|
dongmao zhang
|
87525fab92
|
[bitsandbytes]: support read bnb pre-quantized model (#5753)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-07-23 23:45:09 +00:00 |
|
Michael Goin
|
01c16ede6b
|
[CI] Add smoke test for non-uniform AutoFP8 quantization (#6702)
|
2024-07-23 22:45:12 +00:00 |
|
Michael Goin
|
9e0b558a09
|
[Misc] Support FP8 kv cache scales from compressed-tensors (#6528)
|
2024-07-23 04:11:50 +00:00 |
|
Alexander Matveev
|
396d92d5e0
|
[Kernel][Core] Add AWQ support to the Marlin kernel (#6612)
|
2024-07-21 19:41:42 -04:00 |
|
Michael Goin
|
978aed5300
|
[Kernel][Attention] Separate Attention.kv_scale into k_scale and v_scale (#6081)
|
2024-07-16 15:31:32 -07:00 |
|
Robert Shaw
|
b675069d74
|
[ Misc ] Refactor Marlin Python Utilities (#6082)
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
|
2024-07-11 15:40:11 +00:00 |
|
Robert Shaw
|
abfe705a02
|
[ Misc ] Support Fp8 via llm-compressor (#6110)
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
|
2024-07-07 20:42:11 +00:00 |
|
Robert Shaw
|
62963d129e
|
[ Misc ] Clean Up CompressedTensorsW8A8 (#6113)
|
2024-07-03 22:50:08 +00:00 |
|
Michael Goin
|
47f0954af0
|
[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (#5975)
|
2024-07-03 17:38:00 +00:00 |
|
youkaichao
|
482045ee77
|
[hardware][misc] introduce platform abstraction (#6080)
|
2024-07-02 20:12:22 -07:00 |
|
Qubitium-ModelCloud
|
ee93f4f92a
|
[CORE] Quantized lm-head Framework (#4442)
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
Co-authored-by: ZX <zx@lbx.dev>
|
2024-07-02 22:25:17 +00:00 |
|
youkaichao
|
614aa51203
|
[misc][cuda] use nvml to avoid accidentally cuda initialization (#6007)
|
2024-06-30 20:07:34 -07:00 |
|
Robert Shaw
|
af9ad46fca
|
[ Misc ] Refactor w8a8 to use process_weights_after_load (Simplify Weight Loading) (#5940)
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
|
2024-06-30 23:06:27 +00:00 |
|
Dipika Sikka
|
dd248f7675
|
[Misc] Update w4a16 compressed-tensors support to include w8a16 (#5794)
|
2024-06-25 19:23:35 +00:00 |
|
Dipika Sikka
|
4a30d7e3cc
|
[Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes (#5650)
|
2024-06-19 18:06:44 -04:00 |
|
Dipika Sikka
|
95db455e7f
|
[Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization (#5542)
|
2024-06-18 12:45:05 -04:00 |
|
Dipika Sikka
|
890d8d960b
|
[Kernel] compressed-tensors marlin 24 support (#5435)
|
2024-06-17 12:32:48 -04:00 |
|
Michael Goin
|
4a6769053a
|
[CI][BugFix] Flip is_quant_method_supported condition (#5577)
|
2024-06-16 14:07:34 +00:00 |
|