20231088/vllm - vllm - Luminance Code Repo

20231088/vllm

Author	SHA1	Message	Date
Michael Goin	e165528778	[CI] Move quantization cpu offload tests out of fastcheck (#7574 )	2024-08-15 21:16:20 -07:00
Kyle Sayers	f55a9aea45	[Misc] Revert `compressed-tensors` code reuse (#7521 )	2024-08-14 15:07:37 -07:00
Kyle Sayers	373538f973	[Misc] `compressed-tensors` code reuse (#7277 )	2024-08-13 19:05:15 -04:00
Michael Goin	5223199e03	[Bugfix][FP8] Fix dynamic FP8 Marlin quantization (#7219 )	2024-08-07 11:23:12 -07:00
Dipika Sikka	0f7052bc7e	[Misc] Refactor linear layer weight loading; introduce `BasevLLMParameter` and `weight_loader_v2` (#5874 )	2024-08-07 09:17:58 -07:00
Isotr0py	360bd67cf0	[Core] Support loading GGUF model (#5191 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-08-05 17:54:23 -06:00
Michael Goin	fb3db61688	[CI/Build] Remove sparseml requirement from testing (#7037 )	2024-08-01 12:00:51 -07:00
Tyler Michael Smith	d7a299edaa	[Kernel] Remove scaled_fp8_quant kernel padding footgun (#6842 )	2024-07-30 16:37:01 -04:00
Michael Goin	65b1f121c8	[Bugfix] Fix `kv_cache_dtype=fp8` without scales for FP8 checkpoints (#6761 )	2024-07-25 09:46:15 -07:00
dongmao zhang	87525fab92	[bitsandbytes]: support read bnb pre-quantized model (#5753 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-07-23 23:45:09 +00:00
Michael Goin	01c16ede6b	[CI] Add smoke test for non-uniform AutoFP8 quantization (#6702 )	2024-07-23 22:45:12 +00:00
Michael Goin	9e0b558a09	[Misc] Support FP8 kv cache scales from compressed-tensors (#6528 )	2024-07-23 04:11:50 +00:00
Alexander Matveev	396d92d5e0	[Kernel][Core] Add AWQ support to the Marlin kernel (#6612 )	2024-07-21 19:41:42 -04:00
Michael Goin	978aed5300	[Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` (#6081 )	2024-07-16 15:31:32 -07:00
Robert Shaw	b675069d74	[ Misc ] Refactor Marlin Python Utilities (#6082 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>	2024-07-11 15:40:11 +00:00
Robert Shaw	abfe705a02	[ Misc ] Support Fp8 via `llm-compressor` (#6110 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic>	2024-07-07 20:42:11 +00:00
Robert Shaw	62963d129e	[ Misc ] Clean Up `CompressedTensorsW8A8` (#6113 )	2024-07-03 22:50:08 +00:00
Michael Goin	47f0954af0	[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (#5975 )	2024-07-03 17:38:00 +00:00
youkaichao	482045ee77	[hardware][misc] introduce platform abstraction (#6080 )	2024-07-02 20:12:22 -07:00
Qubitium-ModelCloud	ee93f4f92a	[CORE] Quantized lm-head Framework (#4442 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic.com> Co-authored-by: ZX <zx@lbx.dev>	2024-07-02 22:25:17 +00:00
youkaichao	614aa51203	[misc][cuda] use nvml to avoid accidentally cuda initialization (#6007 )	2024-06-30 20:07:34 -07:00
Robert Shaw	af9ad46fca	[ Misc ] Refactor w8a8 to use `process_weights_after_load` (Simplify Weight Loading) (#5940 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic>	2024-06-30 23:06:27 +00:00
Dipika Sikka	dd248f7675	[Misc] Update `w4a16` `compressed-tensors` support to include `w8a16` (#5794 )	2024-06-25 19:23:35 +00:00
Dipika Sikka	4a30d7e3cc	[Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes (#5650 )	2024-06-19 18:06:44 -04:00
Dipika Sikka	95db455e7f	[Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization (#5542 )	2024-06-18 12:45:05 -04:00
Dipika Sikka	890d8d960b	[Kernel] `compressed-tensors` marlin 24 support (#5435 )	2024-06-17 12:32:48 -04:00
Michael Goin	4a6769053a	[CI][BugFix] Flip is_quant_method_supported condition (#5577 )	2024-06-16 14:07:34 +00:00
Cyrus Leung	0e9164b40a	[mypy] Enable type checking for test directory (#5017 )	2024-06-15 04:45:31 +00:00
Michael Goin	23ec72fa03	[CI/Build][REDO] Add is_quant_method_supported to control quantization test configurations (#5466 )	2024-06-13 15:18:08 +00:00
Dipika Sikka	c2637a613b	[Kernel] `w4a16` support for `compressed-tensors` (#5385 ) Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-06-13 10:19:56 -04:00
Cody Yu	5985e3427d	[Kernel] Vectorized FP8 quantize kernel (#5396 ) Inspired by #5146, this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large). In details, we applied 3 optimizations: - Use inverted scale so that most divisions are changed to multiplications. - Unroll the loop by 4 times to improve ILP. - Use vectorized 4 to transfer data between HBM and SRAM.	2024-06-12 14:07:26 -07:00
Simon Mo	e3c12bf6d2	Revert "[CI/Build] Add `is_quant_method_supported` to control quantization test configurations" (#5463 )	2024-06-12 10:03:24 -07:00
Michael Goin	3dd6853bc8	[CI/Build] Add `is_quant_method_supported` to control quantization test configurations (#5253 )	2024-06-12 09:58:02 -07:00
Dipika Sikka	5884c2b454	[Misc] Update to comply with the new `compressed-tensors` config (#5350 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-06-10 03:49:46 +00:00
youkaichao	8ea5e44a43	[CI/Test] improve robustness of test (vllm_runner) (#5357 ) [CI/Test] improve robustness of test by replacing del with context manager (vllm_runner) (#5357)	2024-06-08 08:59:20 +00:00
Dipika Sikka	ca3ea51bde	[Kernel] Dynamic Per-Token Activation Quantization (#5037 ) Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-06-07 09:36:26 -07:00
chenqianfzh	b9c0605a8e	[Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776 )	2024-06-01 14:51:10 -06:00
Dipika Sikka	a1242324c9	[Kernel] Initial Activation Quantization Support (#4525 ) Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-05-23 21:29:18 +00:00
Cyrus Leung	350f9e107f	[CI/Build] Move `test_utils.py` to `tests/utils.py` (#4425 ) Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time) Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.	2024-05-13 23:50:09 +09:00
Robert Shaw	73c8d677e5	[Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin (#3922 ) Co-authored-by: alexm <alexm@neuralmagic.com> Co-authored-by: mgoin <michael@neuralmagic.com>	2024-04-29 09:35:34 -07:00
Cody Yu	a62aaf1df5	[Misc][Refactor] Generalize linear_method to be quant_method (#4373 )	2024-04-26 16:41:14 -04:00
Cody Yu	a22cdea371	[Kernel][FP8] Initial support with dynamic per-tensor scaling (#4118 ) Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726 This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine. Algorithm: We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass. Initial Results: Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128: BF16: 1.47s FP8: 1.66s I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.	2024-04-20 04:28:57 +00:00
Antoni Baum	69e1d2fb69	[Core] Refactor model loading code (#4097 )	2024-04-16 11:34:39 -07:00
Qubitium	7d4e1b85e7	[Misc] Add support for new autogptq checkpoint_format (#3689 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>	2024-04-01 19:32:01 -04:00

1 2

94 Commits