omer-dayan
995f56236b
[Core] Loading model from S3 using RunAI Model Streamer as optional loader ( #10192 )
...
Signed-off-by: OmerD <omer@run.ai>
2024-12-20 16:46:24 +00:00
Aaron Pham
21063c11c7
[CI/Build] drop support for Python 3.8 EOL ( #8464 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
2024-11-06 07:11:55 +00:00
Cyrus Leung
390be74649
[Misc] Print stack trace using logger.exception
( #9461 )
2024-10-17 13:55:48 +00:00
Tyler Michael Smith
7342a7d7f8
[Model] Support Mamba ( #6484 )
2024-10-11 15:40:06 +00:00
Pavani Majety
efcf946a15
[Hardware][NV] Add support for ModelOpt static scaling checkpoints. ( #6112 )
2024-09-11 00:38:40 -04:00
Patrick von Platen
29f49cd6e3
[Model] Allow loading from original Mistral format ( #8168 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-09-06 17:02:05 -06:00
Isotr0py
12e1c65bc9
[Model] Add AWQ quantization support for InternVL2 model ( #7187 )
2024-08-20 23:18:57 -07:00
Michael Goin
21313e09e3
[Bugfix] Fix default weight loading for scalars ( #7534 )
2024-08-15 13:10:22 -07:00
Cyrus Leung
7025b11d94
[Bugfix] Fix weight loading for Chameleon when TP>1 ( #7410 )
2024-08-13 05:33:41 +00:00
Isotr0py
8334c39f37
[Bugfix] Fix new Llama3.1 GGUF model loading ( #7269 )
2024-08-08 13:42:44 -07:00
Isotr0py
360bd67cf0
[Core] Support loading GGUF model ( #5191 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-05 17:54:23 -06:00
Woosuk Kwon
23993a7997
[Bugfix][TPU] Do not use torch.Generator for TPUs ( #6981 )
2024-07-31 18:50:28 -07:00
dongmao zhang
87525fab92
[bitsandbytes]: support read bnb pre-quantized model ( #5753 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-07-23 23:45:09 +00:00
Simon Mo
3eda4ec780
support ignore patterns in model loader ( #6673 )
2024-07-22 23:59:42 -07:00
youkaichao
c5201240a4
[misc] only tqdm for first rank ( #6672 )
2024-07-22 21:57:27 -07:00
zhaotyer
e519ae097a
add tqdm when loading checkpoint shards ( #6569 )
...
Co-authored-by: tianyi.zhao <tianyi.zhao@transwarp.io>
Co-authored-by: youkaichao <youkaichao@126.com>
2024-07-22 20:48:01 -07:00
youkaichao
ce37be7ba0
[misc][distributed] add seed to dummy weights ( #6491 )
2024-07-16 19:16:34 -07:00
Michael Goin
978aed5300
[Kernel][Attention] Separate Attention.kv_scale
into k_scale
and v_scale
( #6081 )
2024-07-16 15:31:32 -07:00
Robert Shaw
73030b7dae
[ Misc ] Enable Quantizing All Layers of DeekSeekv2 ( #6423 )
2024-07-14 21:38:42 +00:00
Cyrus Leung
0e9164b40a
[mypy] Enable type checking for test directory ( #5017 )
2024-06-15 04:45:31 +00:00
Dipika Sikka
5884c2b454
[Misc] Update to comply with the new compressed-tensors
config ( #5350 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-06-10 03:49:46 +00:00
chenqianfzh
b9c0605a8e
[Feature][Kernel] Support bitsandbytes quantization and QLoRA ( #4776 )
2024-06-01 14:51:10 -06:00
Robert Shaw
919770957f
[Bugfix] Fix Mistral v0.3 Weight Loading ( #5005 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-05-24 12:28:27 +00:00
Dipika Sikka
a1242324c9
[Kernel] Initial Activation Quantization Support ( #4525 )
...
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-05-23 21:29:18 +00:00
Mor Zusman
f0eecee610
[Bugfix] Fix dummy weight for fp8 ( #4916 )
...
Allow dummy load format for fp8,
torch.uniform_ doesn't support FP8 at the moment
Co-authored-by: Mor Zusman <morz@ai21.com>
2024-05-20 18:44:25 +00:00
Prashant Gupta
d6e520e170
[Core] Support offline use of local cache for models ( #4374 )
...
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Travis Johnson <tjohnson31415@gmail.com>
2024-04-27 09:59:55 -07:00
SangBin Cho
a88081bf76
[CI] Disable non-lazy string operation on logging ( #4326 )
...
Co-authored-by: Danny Guinther <dguinther@neuralmagic.com>
2024-04-26 00:16:58 -07:00
Cody Yu
a22cdea371
[Kernel][FP8] Initial support with dynamic per-tensor scaling ( #4118 )
...
Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726
This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine.
Algorithm:
We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass.
Initial Results:
Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128:
BF16: 1.47s
FP8: 1.66s
I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.
2024-04-20 04:28:57 +00:00
Antoni Baum
69e1d2fb69
[Core] Refactor model loading code ( #4097 )
2024-04-16 11:34:39 -07:00