20231088/vllm - vllm - Luminance Code Repo

20231088/vllm

Author	SHA1	Message	Date
Aaron Pham	21063c11c7	[CI/Build] drop support for Python 3.8 EOL (#8464 ) Signed-off-by: Aaron Pham <contact@aarnphm.xyz>	2024-11-06 07:11:55 +00:00
bnellnm	d93478b399	[Bugfix] Upgrade to pytorch 2.5.1 (#10001 ) Signed-off-by: Bill Nell <bill@neuralmagic.com>	2024-11-04 15:11:28 -08:00
bnellnm	3cb07a36a2	[Misc] Upgrade to pytorch 2.5 (#9588 ) Signed-off-by: Bill Nell <bill@neuralmagic.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com>	2024-10-27 09:44:24 +00:00
Charlie Fu	59449095ab	[Performance][Kernel] Fused_moe Performance Improvement (#9384 ) Signed-off-by: charlifu <charlifu@amd.com>	2024-10-24 15:37:52 -07:00
Luka Govedič	51c24c9736	[Build] Fix `FetchContent` multiple build issue (#9596 ) Signed-off-by: luka <luka@neuralmagic.com>	2024-10-23 12:43:07 +08:00
Lucas Wilkinson	d1e8240875	[Bugfix] Fix spurious "No compiled cutlass_scaled_mm ..." for W8A8 on Turing (#9487 )	2024-10-22 15:41:13 -07:00
bnellnm	eca2c5f7c0	[Bugfix] Fix support for dimension like integers and ScalarType (#9299 )	2024-10-17 19:08:34 +00:00
Lucas Wilkinson	717a5f82cd	[Bugfix][CI/Build] Fix CUDA 11.8 Build (#9386 )	2024-10-16 00:15:21 +00:00
Lucas Wilkinson	de9fb4bef8	[Bugfix][CI/Build] Fix docker build where CUDA archs < 7.0 are being detected (#9254 )	2024-10-11 15:57:39 -04:00
ElizaWszola	05d686432f	[Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE (#8973 ) Co-authored-by: Dipika <dipikasikka1@gmail.com> Co-authored-by: Dipika Sikka <ds3822@columbia.edu>	2024-10-04 12:34:44 -06:00
Lucas Wilkinson	22482e495e	[Bugfix] Flash attention arches not getting set properly (#9062 )	2024-10-04 09:43:15 -06:00
Lucas Wilkinson	aeb37c2a72	[CI/Build] Per file CUDA Archs (improve wheel size and dev build times) (#8845 )	2024-10-03 22:55:25 -04:00
Tyler Michael Smith	2e7fe7e79f	[Build/CI] Set FETCHCONTENT_BASE_DIR to one location for better caching (#8930 )	2024-09-29 03:13:01 +00:00
ElizaWszola	a928ded995	[Kernel] Split Marlin MoE kernels into multiple files (#8661 ) Co-authored-by: mgoin <michael@neuralmagic.com>	2024-09-24 09:31:42 -07:00
Lucas Wilkinson	86e9c8df29	[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (#7701 ) Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2024-09-23 13:46:26 -04:00
Tyler Michael Smith	3dda7c2250	[Bugfix] Avoid some bogus messages RE CUTLASS's revision when building (#8702 )	2024-09-22 22:24:59 -04:00
youkaichao	0e40ac9b7b	[ci][build] fix vllm-flash-attn (#8699 )	2024-09-21 23:24:58 -07:00
Luka Govedič	71c60491f2	[Kernel] Build flash-attn from source (#8245 )	2024-09-20 23:27:10 -07:00
Charlie Fu	1ef0d2efd0	[Kernel][Hardware][Amd]Custom paged attention kernel for rocm (#8310 )	2024-09-13 17:01:11 -07:00
Tyler Michael Smith	94144e726c	[CI/Build][Kernel] Update CUTLASS to 3.5.1 tag (#8043 )	2024-09-10 23:51:58 +00:00
Dipika Sikka	23f322297f	[Misc] Remove `SqueezeLLM` (#8220 )	2024-09-06 16:29:03 -06:00
Mor Zusman	fdd9daafa3	[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM (#7651 )	2024-08-28 15:06:52 -07:00
Dipika Sikka	fc911880cc	[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7766 ) Co-authored-by: ElizaWszola <eliza@neuralmagic.com>	2024-08-27 15:07:09 -07:00
Lucas Wilkinson	55d63b1211	[Bugfix] Don't build machete on cuda <12.0 (#7757 )	2024-08-22 08:28:52 -04:00
Michael Goin	aae74ef95c	Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527 )" (#7764 )	2024-08-22 03:42:14 +00:00
Dipika Sikka	8678a69ab5	[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527 ) Co-authored-by: ElizaWszola <eliza@neuralmagic.com>	2024-08-21 16:17:10 -07:00
sasha0552	1b32e02648	[Bugfix] Pass PYTHONPATH from setup.py to CMake (#7730 )	2024-08-21 11:17:48 -07:00
Lucas Wilkinson	5288c06aa0	[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel (#7174 )	2024-08-20 07:09:33 -06:00
Daniele	774cd1d3bf	[CI/Build] bump minimum cmake version (#6999 )	2024-08-12 16:29:20 -07:00
Isotr0py	360bd67cf0	[Core] Support loading GGUF model (#5191 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-08-05 17:54:23 -06:00
Tyler Michael Smith	4cf1dc39be	[Bugfix][CI/Build] Fix CUTLASS FetchContent (#7171 )	2024-08-05 14:22:57 -07:00
Tyler Michael Smith	8571ac4672	[Kernel] Update CUTLASS to 3.5.1 (#7085 )	2024-08-05 15:13:43 -04:00
Lucas Wilkinson	a8d604ca2a	[Misc] Disambiguate quantized types via a new ScalarType (#6396 )	2024-08-02 13:51:58 -07:00
Michael Goin	b482b9a5b1	[CI/Build] Add support for Python 3.12 (#7035 )	2024-08-02 13:51:22 -07:00
Tyler Michael Smith	6a11fdfbb8	[CI/Build][Bugfix] Fix CUTLASS header-only line (#7034 )	2024-08-01 13:51:15 -07:00
Sage Moore	7e0861bd0b	[CI/Build] Update PyTorch to 2.4.0 (#6951 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-08-01 11:11:24 -07:00
Jee Jee Li	7ecee34321	[Kernel][RFC] Refactor the punica kernel based on Triton (#5036 )	2024-07-31 17:12:24 -07:00
HandH1998	6512937de1	Support W4A8 quantization for vllm (#5218 )	2024-07-31 07:55:21 -06:00
Alexander Matveev	396d92d5e0	[Kernel][Core] Add AWQ support to the Marlin kernel (#6612 )	2024-07-21 19:41:42 -04:00
Matt Wong	06d6c5fe9f	[Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA, add partially supported model notes (#6543 )	2024-07-20 09:39:07 -07:00
Alexander Matveev	e76466dde2	[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step (#6338 )	2024-07-17 14:30:28 -07:00
Cody Yu	aa48e502fb	[MISC] Upgrade dependency to PyTorch 2.3.1 (#5327 )	2024-07-12 12:04:26 -07:00
Michael Goin	47f0954af0	[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (#5975 )	2024-07-03 17:38:00 +00:00
Luka Govedič	5bfd1bbc98	[Kernel] Adding bias epilogue support for `cutlass_scaled_mm` (#5560 ) Co-authored-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>	2024-06-26 15:16:00 +00:00
Matt Wong	dd793d1de5	[Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes (#5422 )	2024-06-25 15:56:15 -07:00
Tyler Michael Smith	85657b5607	[Kernel] Factor out epilogues from cutlass kernels (#5391 ) Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: zifeitong <zifei.tong@parasail.io> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-06-13 11:22:19 -07:00
bnellnm	5467ac3196	[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047 )	2024-06-09 16:23:30 -04:00
Divakar Verma	a66cf40b20	[Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer (#4927 ) This PR enables the fused topk_softmax kernel used in moe layer for HIP	2024-06-02 14:13:26 -07:00
simon-mo	533c217792	Fix cutlass sm_90a vesrion in CMakeList	2024-05-31 02:13:01 +00:00
Simon Mo	45a1a69b98	[Build] Disable sm_90a in cu11 (#5141 )	2024-05-30 14:37:16 -07:00

1 2

67 Commits