20231088/vllm - vllm - Luminance Code Repo

20231088/vllm

Author	SHA1	Message	Date
Luka Govedič	bf2ddc6610	[bugfix] Fix static asymmetric quantization case (#10334 ) Signed-off-by: Daniël de Kok <me@danieldk.eu> Signed-off-by: luka <luka@neuralmagic.com> Co-authored-by: Daniël de Kok <me@danieldk.eu>	2024-11-15 09:35:11 +08:00
Michael Goin	22f8a69549	[Misc] Directly use compressed-tensors for checkpoint definitions (#8909 ) Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-10-15 15:40:25 -07:00
Luka Govedič	172d1cd276	[Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method (#7271 )	2024-09-27 14:25:10 -04:00
Li, Jiang	0b952af458	[Hardware][Intel] Support compressed-tensor W8A8 for CPU backend (#7257 )	2024-09-11 09:46:46 -07:00
Dipika Sikka	fc911880cc	[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7766 ) Co-authored-by: ElizaWszola <eliza@neuralmagic.com>	2024-08-27 15:07:09 -07:00
Kyle Sayers	f55a9aea45	[Misc] Revert `compressed-tensors` code reuse (#7521 )	2024-08-14 15:07:37 -07:00
Kyle Sayers	373538f973	[Misc] `compressed-tensors` code reuse (#7277 )	2024-08-13 19:05:15 -04:00
Dipika Sikka	0f7052bc7e	[Misc] Refactor linear layer weight loading; introduce `BasevLLMParameter` and `weight_loader_v2` (#5874 )	2024-08-07 09:17:58 -07:00
Michael Goin	fb3db61688	[CI/Build] Remove sparseml requirement from testing (#7037 )	2024-08-01 12:00:51 -07:00
Michael Goin	9e0b558a09	[Misc] Support FP8 kv cache scales from compressed-tensors (#6528 )	2024-07-23 04:11:50 +00:00
Robert Shaw	b675069d74	[ Misc ] Refactor Marlin Python Utilities (#6082 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>	2024-07-11 15:40:11 +00:00
Robert Shaw	abfe705a02	[ Misc ] Support Fp8 via `llm-compressor` (#6110 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic>	2024-07-07 20:42:11 +00:00
Robert Shaw	62963d129e	[ Misc ] Clean Up `CompressedTensorsW8A8` (#6113 )	2024-07-03 22:50:08 +00:00
Robert Shaw	af9ad46fca	[ Misc ] Refactor w8a8 to use `process_weights_after_load` (Simplify Weight Loading) (#5940 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic>	2024-06-30 23:06:27 +00:00
Dipika Sikka	dd248f7675	[Misc] Update `w4a16` `compressed-tensors` support to include `w8a16` (#5794 )	2024-06-25 19:23:35 +00:00
Dipika Sikka	4a30d7e3cc	[Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes (#5650 )	2024-06-19 18:06:44 -04:00
Dipika Sikka	95db455e7f	[Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization (#5542 )	2024-06-18 12:45:05 -04:00
Dipika Sikka	890d8d960b	[Kernel] `compressed-tensors` marlin 24 support (#5435 )	2024-06-17 12:32:48 -04:00
Dipika Sikka	c2637a613b	[Kernel] `w4a16` support for `compressed-tensors` (#5385 ) Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-06-13 10:19:56 -04:00
Dipika Sikka	5884c2b454	[Misc] Update to comply with the new `compressed-tensors` config (#5350 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-06-10 03:49:46 +00:00
youkaichao	8ea5e44a43	[CI/Test] improve robustness of test (vllm_runner) (#5357 ) [CI/Test] improve robustness of test by replacing del with context manager (vllm_runner) (#5357)	2024-06-08 08:59:20 +00:00
Dipika Sikka	ca3ea51bde	[Kernel] Dynamic Per-Token Activation Quantization (#5037 ) Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-06-07 09:36:26 -07:00
Dipika Sikka	a1242324c9	[Kernel] Initial Activation Quantization Support (#4525 ) Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-05-23 21:29:18 +00:00

23 Commits