Michael Goin
e165528778
[CI] Move quantization cpu offload tests out of fastcheck ( #7574 )
2024-08-15 21:16:20 -07:00
Kyle Sayers
f55a9aea45
[Misc] Revert compressed-tensors
code reuse ( #7521 )
2024-08-14 15:07:37 -07:00
Kyle Sayers
373538f973
[Misc] compressed-tensors
code reuse ( #7277 )
2024-08-13 19:05:15 -04:00
Michael Goin
5223199e03
[Bugfix][FP8] Fix dynamic FP8 Marlin quantization ( #7219 )
2024-08-07 11:23:12 -07:00
Dipika Sikka
0f7052bc7e
[Misc] Refactor linear layer weight loading; introduce BasevLLMParameter
and weight_loader_v2
( #5874 )
2024-08-07 09:17:58 -07:00
Isotr0py
360bd67cf0
[Core] Support loading GGUF model ( #5191 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-05 17:54:23 -06:00
Michael Goin
fb3db61688
[CI/Build] Remove sparseml requirement from testing ( #7037 )
2024-08-01 12:00:51 -07:00
Tyler Michael Smith
d7a299edaa
[Kernel] Remove scaled_fp8_quant kernel padding footgun ( #6842 )
2024-07-30 16:37:01 -04:00
Michael Goin
65b1f121c8
[Bugfix] Fix kv_cache_dtype=fp8
without scales for FP8 checkpoints ( #6761 )
2024-07-25 09:46:15 -07:00
dongmao zhang
87525fab92
[bitsandbytes]: support read bnb pre-quantized model ( #5753 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-07-23 23:45:09 +00:00
Michael Goin
01c16ede6b
[CI] Add smoke test for non-uniform AutoFP8 quantization ( #6702 )
2024-07-23 22:45:12 +00:00
Michael Goin
9e0b558a09
[Misc] Support FP8 kv cache scales from compressed-tensors ( #6528 )
2024-07-23 04:11:50 +00:00
Alexander Matveev
396d92d5e0
[Kernel][Core] Add AWQ support to the Marlin kernel ( #6612 )
2024-07-21 19:41:42 -04:00
Michael Goin
978aed5300
[Kernel][Attention] Separate Attention.kv_scale
into k_scale
and v_scale
( #6081 )
2024-07-16 15:31:32 -07:00
Robert Shaw
b675069d74
[ Misc ] Refactor Marlin Python Utilities ( #6082 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
2024-07-11 15:40:11 +00:00
Robert Shaw
abfe705a02
[ Misc ] Support Fp8 via llm-compressor
( #6110 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-07-07 20:42:11 +00:00
Robert Shaw
62963d129e
[ Misc ] Clean Up CompressedTensorsW8A8
( #6113 )
2024-07-03 22:50:08 +00:00
Michael Goin
47f0954af0
[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin ( #5975 )
2024-07-03 17:38:00 +00:00
youkaichao
482045ee77
[hardware][misc] introduce platform abstraction ( #6080 )
2024-07-02 20:12:22 -07:00
Qubitium-ModelCloud
ee93f4f92a
[CORE] Quantized lm-head Framework ( #4442 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
Co-authored-by: ZX <zx@lbx.dev>
2024-07-02 22:25:17 +00:00
youkaichao
614aa51203
[misc][cuda] use nvml to avoid accidentally cuda initialization ( #6007 )
2024-06-30 20:07:34 -07:00
Robert Shaw
af9ad46fca
[ Misc ] Refactor w8a8 to use process_weights_after_load
(Simplify Weight Loading) ( #5940 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-06-30 23:06:27 +00:00
Dipika Sikka
dd248f7675
[Misc] Update w4a16
compressed-tensors
support to include w8a16
( #5794 )
2024-06-25 19:23:35 +00:00
Dipika Sikka
4a30d7e3cc
[Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes ( #5650 )
2024-06-19 18:06:44 -04:00
Dipika Sikka
95db455e7f
[Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization ( #5542 )
2024-06-18 12:45:05 -04:00
Dipika Sikka
890d8d960b
[Kernel] compressed-tensors
marlin 24 support ( #5435 )
2024-06-17 12:32:48 -04:00
Michael Goin
4a6769053a
[CI][BugFix] Flip is_quant_method_supported condition ( #5577 )
2024-06-16 14:07:34 +00:00
Cyrus Leung
0e9164b40a
[mypy] Enable type checking for test directory ( #5017 )
2024-06-15 04:45:31 +00:00
Michael Goin
23ec72fa03
[CI/Build][REDO] Add is_quant_method_supported to control quantization test configurations ( #5466 )
2024-06-13 15:18:08 +00:00
Dipika Sikka
c2637a613b
[Kernel] w4a16
support for compressed-tensors
( #5385 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-06-13 10:19:56 -04:00
Cody Yu
5985e3427d
[Kernel] Vectorized FP8 quantize kernel ( #5396 )
...
Inspired by #5146 , this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large).
In details, we applied 3 optimizations:
- Use inverted scale so that most divisions are changed to multiplications.
- Unroll the loop by 4 times to improve ILP.
- Use vectorized 4 to transfer data between HBM and SRAM.
2024-06-12 14:07:26 -07:00
Simon Mo
e3c12bf6d2
Revert "[CI/Build] Add is_quant_method_supported
to control quantization test configurations" ( #5463 )
2024-06-12 10:03:24 -07:00
Michael Goin
3dd6853bc8
[CI/Build] Add is_quant_method_supported
to control quantization test configurations ( #5253 )
2024-06-12 09:58:02 -07:00
Dipika Sikka
5884c2b454
[Misc] Update to comply with the new compressed-tensors
config ( #5350 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-06-10 03:49:46 +00:00
youkaichao
8ea5e44a43
[CI/Test] improve robustness of test (vllm_runner) ( #5357 )
...
[CI/Test] improve robustness of test by replacing del with context manager (vllm_runner) (#5357 )
2024-06-08 08:59:20 +00:00
Dipika Sikka
ca3ea51bde
[Kernel] Dynamic Per-Token Activation Quantization ( #5037 )
...
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-06-07 09:36:26 -07:00
chenqianfzh
b9c0605a8e
[Feature][Kernel] Support bitsandbytes quantization and QLoRA ( #4776 )
2024-06-01 14:51:10 -06:00
Dipika Sikka
a1242324c9
[Kernel] Initial Activation Quantization Support ( #4525 )
...
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-05-23 21:29:18 +00:00
Cyrus Leung
350f9e107f
[CI/Build] Move test_utils.py
to tests/utils.py
( #4425 )
...
Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time)
Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.
2024-05-13 23:50:09 +09:00
Robert Shaw
73c8d677e5
[Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin ( #3922 )
...
Co-authored-by: alexm <alexm@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-04-29 09:35:34 -07:00
Cody Yu
a62aaf1df5
[Misc][Refactor] Generalize linear_method to be quant_method ( #4373 )
2024-04-26 16:41:14 -04:00
Cody Yu
a22cdea371
[Kernel][FP8] Initial support with dynamic per-tensor scaling ( #4118 )
...
Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726
This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine.
Algorithm:
We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass.
Initial Results:
Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128:
BF16: 1.47s
FP8: 1.66s
I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.
2024-04-20 04:28:57 +00:00
Antoni Baum
69e1d2fb69
[Core] Refactor model loading code ( #4097 )
2024-04-16 11:34:39 -07:00
Qubitium
7d4e1b85e7
[Misc] Add support for new autogptq checkpoint_format ( #3689 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
2024-04-01 19:32:01 -04:00