20231088/vllm - vllm - Luminance Code Repo

20231088/vllm

Author	SHA1	Message	Date
Dipika Sikka	60508ffda9	[Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support (#10995 ) Co-authored-by: Faraz Shahsavan <faraz.shahsavan@gmail.com> Co-authored-by: ilmarkov <markovilya197@gmail.com> Co-authored-by: Rahul Tuli <rahul@neuralmagic.com> Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>	2024-12-18 09:57:16 -05:00
Luka Govedič	30870b4f66	[torch.compile] Dynamic fp8 + rms_norm fusion (#10906 ) Signed-off-by: luka <luka@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-12-13 03:19:23 +00:00
kliuae	7c25fe45a6	[AMD] Add support for GGUF quantization on ROCm (#10254 )	2024-11-22 21:14:49 -08:00
Luka Govedič	4f93dfe952	[torch.compile] Fuse RMSNorm with quant (#9138 ) Signed-off-by: luka <luka@neuralmagic.com> Co-authored-by: youkaichao <youkaichao@126.com>	2024-11-08 21:20:08 +00:00
Hanzhi Zhou	6192e9b8fe	[Core][Distributed] Refactor ipc buffer init in CustomAllreduce (#10030 ) Signed-off-by: Hanzhi Zhou <hanzhi713@gmail.com>	2024-11-06 23:50:47 -08:00
youkaichao	8549c82660	[core] cudagraph output with tensor weak reference (#9724 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-10-27 00:19:28 -07:00
Charlie Fu	59449095ab	[Performance][Kernel] Fused_moe Performance Improvement (#9384 ) Signed-off-by: charlifu <charlifu@amd.com>	2024-10-24 15:37:52 -07:00
Jee Jee Li	295a061fb3	[Kernel] add kernel for FATReLU (#9610 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2024-10-24 16:18:27 +08:00
Mor Zusman	fb60ae9b91	[Kernel][Model] Improve continuous batching for Jamba and Mamba (#9189 )	2024-10-16 12:12:43 -04:00
Lucas Wilkinson	aeb37c2a72	[CI/Build] Per file CUDA Archs (improve wheel size and dev build times) (#8845 )	2024-10-03 22:55:25 -04:00
Mor Zusman	f13a07b1f8	[Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model (#8533 )	2024-09-29 17:35:58 -04:00
Lucas Wilkinson	86e9c8df29	[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (#7701 ) Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2024-09-23 13:46:26 -04:00
Tyler Michael Smith	8110e44529	[Kernel] Change interface to Mamba causal_conv1d_update for continuous batching (#8012 )	2024-09-17 23:44:27 +00:00
youkaichao	99aa4eddaf	[torch.compile] register allreduce operations as custom ops (#8526 )	2024-09-16 22:57:57 -07:00
Luka Govedič	5d73ae49d6	[Kernel] AQ AZP 3/4: Asymmetric quantization kernels (#7270 )	2024-09-16 11:52:40 -07:00
William Lin	a6c0f3658d	[multi-step] add flashinfer backend (#7928 )	2024-09-12 11:16:22 -07:00
bnellnm	73202dbe77	[Kernel][Misc] register ops to prevent graph breaks (#6917 ) Co-authored-by: Sage Moore <sage@neuralmagic.com>	2024-09-11 12:52:19 -07:00
Dipika Sikka	23f322297f	[Misc] Remove `SqueezeLLM` (#8220 )	2024-09-06 16:29:03 -06:00
Mor Zusman	fdd9daafa3	[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM (#7651 )	2024-08-28 15:06:52 -07:00
Lucas Wilkinson	5288c06aa0	[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel (#7174 )	2024-08-20 07:09:33 -06:00
bnellnm	37fd47e780	[Kernel] fix types used in aqlm and ggml kernels to support dynamo (#7596 )	2024-08-16 14:00:11 -07:00
Luka Govedič	8d59dbb000	[Kernel] Add per-tensor and per-token AZP epilogues (#5941 ) Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2024-08-06 18:17:08 +00:00
Isotr0py	360bd67cf0	[Core] Support loading GGUF model (#5191 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-08-05 17:54:23 -06:00
Lucas Wilkinson	a8d604ca2a	[Misc] Disambiguate quantized types via a new ScalarType (#6396 )	2024-08-02 13:51:58 -07:00
HandH1998	6512937de1	Support W4A8 quantization for vllm (#5218 )	2024-07-31 07:55:21 -06:00
Alexander Matveev	75acdaa4b6	[Kernel] Increase precision of GPTQ/AWQ Marlin kernel (#6795 )	2024-07-27 17:52:33 -04:00
Alexander Matveev	396d92d5e0	[Kernel][Core] Add AWQ support to the Marlin kernel (#6612 )	2024-07-21 19:41:42 -04:00
Varun Sundar Rabindranath	2e26564259	[ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub (#6593 ) Co-authored-by: Varun Sundar Rabindranth <varun@neuralmagic.com>	2024-07-19 18:15:26 -07:00
Varun Sundar Rabindranath	b5241e41d9	[ Kernel ] FP8 Dynamic-Per-Token Quant Kernel (#6511 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-07-18 01:38:35 +00:00
Alexander Matveev	e76466dde2	[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step (#6338 )	2024-07-17 14:30:28 -07:00
Michael Goin	978aed5300	[Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` (#6081 )	2024-07-16 15:31:32 -07:00
Michael Goin	47f0954af0	[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (#5975 )	2024-07-03 17:38:00 +00:00
Chip Kerchner	38a1674abb	Support CPU inference with VSX PowerPC ISA (#5652 )	2024-06-26 21:53:04 +00:00
Luka Govedič	5bfd1bbc98	[Kernel] Adding bias epilogue support for `cutlass_scaled_mm` (#5560 ) Co-authored-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>	2024-06-26 15:16:00 +00:00
Tyler Michael Smith	3f3b6b2150	[Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels (#5715 )	2024-06-20 18:36:10 +00:00
Roger Wang	ad137cd111	[Model] Port over CLIPVisionModel for VLMs (#5591 )	2024-06-20 11:52:09 +00:00
Tyler Michael Smith	85657b5607	[Kernel] Factor out epilogues from cutlass kernels (#5391 ) Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: zifeitong <zifei.tong@parasail.io> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-06-13 11:22:19 -07:00
bnellnm	5467ac3196	[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047 )	2024-06-09 16:23:30 -04:00
Dipika Sikka	ca3ea51bde	[Kernel] Dynamic Per-Token Activation Quantization (#5037 ) Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-06-07 09:36:26 -07:00
Tyler Michael Smith	cbb2f59cc8	[Kernel] Pass a device pointer into the quantize kernel for the scales (#5159 )	2024-06-03 09:52:30 -07:00
Eric Xihui Lin	8e192ff967	[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model (#4799 ) Co-authored-by: beagleski <yunanzhang@microsoft.com> Co-authored-by: bapatra <bapatra@microsoft.com> Co-authored-by: Barun Patra <codedecde@users.noreply.github.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-05-24 22:00:52 -07:00
Dipika Sikka	a1242324c9	[Kernel] Initial Activation Quantization Support (#4525 ) Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-05-23 21:29:18 +00:00
Michael Goin	5f6d10c14c	[CI/Build] Enforce style for C++ and CUDA code with `clang-format` (#4722 )	2024-05-22 07:18:41 +00:00
Tyler Michael Smith	2060e93659	[Kernel] Add w8a8 CUTLASS kernels (#4749 )	2024-05-16 18:32:50 -04:00
Alexander Matveev	6979ade384	Add GPTQ Marlin 2:4 sparse structured support (#4790 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>	2024-05-16 12:56:15 -04:00
SangBin Cho	3521ba4f25	[Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518 )	2024-05-03 10:20:12 -07:00
alexm-nm	7038e8b803	[Kernel] Support running GPTQ 8-bit models in Marlin (#4533 )	2024-05-02 12:56:22 -04:00
Robert Shaw	73c8d677e5	[Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin (#3922 ) Co-authored-by: alexm <alexm@neuralmagic.com> Co-authored-by: mgoin <michael@neuralmagic.com>	2024-04-29 09:35:34 -07:00
Philipp Moritz	12628d3c78	[Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales (#4343 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-04-27 04:49:59 +00:00
Philipp Moritz	eace8bf0b9	[Kernel] FP8 support for MoE kernel / Mixtral (#4244 ) This PR is the first step towards fixing https://github.com/vllm-project/vllm/pull/3208 It implements dynamic per-tensor scaling (see https://github.com/vllm-project/vllm/pull/4118), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints. It is enough to specify the `quantization="fp8"` argument. You can try out the PR like this: ```python from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2, quantization="fp8") outputs = llm.generate(prompts, sampling_params) # Print the outputs. for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` Performance: For this PR, the focus is on making the code clean (while still trying to get reasonable performance), there is a bunch of optimizations that we will submit as a follow up PR that significantly improve the performance (similar to the numbers in https://github.com/vllm-project/vllm/pull/3954). With this PR, the results are as follows: <img width="725" alt="Screenshot 2024-04-21 at 1 31 50 PM" src="https://github.com/vllm-project/vllm/assets/113316/d8fe1118-07a0-4d4e-8530-37a77d465a03"> Accuracy: The accuracy with this PR on MMLU on `mistralai/Mixtral-8x7B-v0.1` is as follows: ``` \| Groups \|Version\|Filter\|n-shot\|Metric\|Value \| \|Stderr\| \|------------------\|-------\|------\|-----:\|------\|-----:\|---\|-----:\| \|mmlu \|N/A \|none \| 0\|acc \|0.7018\|± \|0.0036\| \| - humanities \|N/A \|none \| 5\|acc \|0.6472\|± \|0.0065\| \| - other \|N/A \|none \| 5\|acc \|0.7673\|± \|0.0072\| \| - social_sciences\|N/A \|none \| 5\|acc \|0.8099\|± \|0.0070\| \| - stem \|N/A \|none \| 5\|acc \|0.6131\|± \|0.0083\| ``` this compares favorably with the fp16 results which are ``` \| Groups \|Version\|Filter\|n-shot\|Metric\|Value \| \|Stderr\| \|------------------\|-------\|------\|-----:\|------\|-----:\|---\|-----:\| \|mmlu \|N/A \|none \| 0\|acc \|0.7020\|± \|0.1313\| \| - humanities \|N/A \|none \| 5\|acc \|0.6425\|± \|0.1349\| \| - other \|N/A \|none \| 5\|acc \|0.7744\|± \|0.1038\| \| - social_sciences\|N/A \|none \| 5\|acc \|0.8131\|± \|0.0695\| \| - stem \|N/A \|none \| 5\|acc \|0.6108\|± \|0.1383\| ``` Happy hacking!	2024-04-24 01:18:23 +00:00

1 2

67 Commits