20231088/vllm - vllm - Luminance Code Repo

20231088/vllm

Author	SHA1	Message	Date
Jeff Daily	a1c8f3796c	dynamic distpatch of fp8 kernels (#14245 ) Signed-off-by: Jeff Daily <jeff.daily@amd.com>	2025-03-11 10:54:56 -04:00
Szymon Ożóg	89cdaa83e7	[Kernel] Add more dtype support for GGUF kernels (#14043 ) Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com> Signed-off-by: SzymonOzog <szymon.ozog@gmail.com>	2025-03-10 07:30:04 -07:00
Jinzhen Lin	d0feea31c7	[Kernel] optimize performance of gptq marlin kernel when n is small (#14138 ) Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>	2025-03-07 11:53:38 -05:00
Thomas Parnell	6bd1dd9d26	[Kernel] [V1] Improved performance for V1 Triton (ROCm) backend (#14152 )	2025-03-06 07:39:16 -08:00
Nicolò Lucchesi	5ee10e990d	[Bugfix][CI] ALiBi test case in xformers multi_query_kv_attention (#11301 )	2025-03-05 20:00:53 -08:00
Michael Goin	dae9ec464c	Temporarily disable test_awq_gemm_opcheck (#14251 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2025-03-05 06:10:35 +00:00
Tyler Michael Smith	72c62eae5f	[V1] EP/TP MoE + DP Attention (#13931 )	2025-03-04 21:27:26 -08:00
TJian	848a6438ae	[ROCm] Faster Custom Paged Attention kernels (#12348 )	2025-03-03 09:24:45 -08:00
Harry Mellor	cf069aa8aa	Update deprecated Python 3.8 typing (#13971 )	2025-03-02 17:34:51 -08:00
YajieWang	6a92ff93e1	[Misc][Kernel]: Add GPTQAllSpark Quantization (#12931 )	2025-02-28 22:30:59 -08:00
Michael Goin	788f284b53	Fix test_block_fp8.py test for MoE (#13915 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2025-02-27 18:00:00 +08:00
Lucas Wilkinson	f95903909f	[Kernel] FlashMLA integration (#13747 ) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2025-02-27 10:35:08 +08:00
Gregory Shtrasberg	aabeb2688f	[ROCm][Quantization][Kernel] Using HIP FP8 header (#12593 )	2025-02-25 00:39:59 -08:00
Harry Mellor	cdc1fa12eb	Remove unused kwargs from model definitions (#13555 )	2025-02-24 17:13:52 -08:00
Jongseok Park	781096e385	Expert Parallelism (EP) Support for DeepSeek V2 (#12583 )	2025-02-24 07:33:20 -08:00
Sage Moore	558db8083c	[V1][Kernel] Refactor the prefix_prefill kernel so that the caller no longer has to pass in the context lengths (#13095 )	2025-02-22 05:25:41 -08:00
Kaixi Hou	e109e598c7	[NVIDIA] Support nvfp4 cutlass gemm (#13571 )	2025-02-22 05:24:05 -08:00
Lucas Wilkinson	288cc6c234	[Attention] MLA with chunked prefill (#12639 ) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Patrick Horn <patrick.horn@gmail.com> Co-authored-by: simon-mo <xmo@berkeley.edu> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2025-02-21 15:30:12 -08:00
Liangfu Chen	3809458456	[Bugfix] Fix invalid rotary embedding unit test (#13431 ) Signed-off-by: Liangfu Chen <liangfc@amazon.com>	2025-02-18 11:52:03 +00:00
youkaichao	124776ebd5	[ci] skip failed tests for flashinfer (#13352 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2025-02-16 22:09:15 +08:00
wchen61	dc0f7ccf8b	[BugFix] Enhance test_pos_encoding to support execution on multi-devices (#13187 ) Signed-off-by: wchen61 <wchen61@foxmail.com>	2025-02-16 08:59:49 +00:00
Tyler Michael Smith	c1e37bf71b	[Kernel][Bugfix] Refactor and Fix CUTLASS 2:4 Sparse Kernels (#13198 ) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>	2025-02-14 00:01:14 +00:00
Kaixi Hou	4fc5c23bb6	[NVIDIA] Support nvfp4 quantization (#12784 )	2025-02-12 19:51:51 -08:00
Yu Chin Fabian Lim	aff404571b	Add Bamba Model (#10909 ) Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2025-02-06 15:22:42 -08:00
Isotr0py	85ac82d228	[Kernel] Make rotary_embedding ops more flexible with input shape (#12777 )	2025-02-06 08:46:13 -08:00
Lucas Wilkinson	75e94309e8	[Perf] Mem align KV caches for CUDA devices (MLA perf improvement) (#12676 ) Signed-off-by: simon-mo <xmo@berkeley.edu> Signed-off-by: Lucas Wilkinson <lcwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Co-authored-by: simon-mo <xmo@berkeley.edu>	2025-02-04 18:22:24 -08:00
Hongxia Yang	c36ac98d01	[AMD][ROCm] Enable DeepSeek model on ROCm (#12662 ) Signed-off-by: Hongxia Yang <hongxia.yang@amd.com> Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>	2025-02-04 08:24:11 +00:00
Russell Bryant	e489ad7a21	[Misc] Add SPDX-License-Identifier headers to python source files (#12628 ) - Add SPDX license headers to python source files - Check for SPDX headers using pre-commit commit 9d7ef44c3cfb72ca4c32e1c677d99259d10d4745 Author: Russell Bryant <rbryant@redhat.com> Date: Fri Jan 31 14:18:24 2025 -0500 Add SPDX license headers to python source files This commit adds SPDX license headers to python source files as recommended to the project by the Linux Foundation. These headers provide a concise way that is both human and machine readable for communicating license information for each source file. It helps avoid any ambiguity about the license of the code and can also be easily used by tools to help manage license compliance. The Linux Foundation runs license scans against the codebase to help ensure we are in compliance with the licenses of the code we use, including dependencies. Having these headers in place helps that tool do its job. More information can be found on the SPDX site: - https://spdx.dev/learn/handling-license-info/ Signed-off-by: Russell Bryant <rbryant@redhat.com> commit 5a1cf1cb3b80759131c73f6a9dddebccac039dea Author: Russell Bryant <rbryant@redhat.com> Date: Fri Jan 31 14:36:32 2025 -0500 Check for SPDX headers using pre-commit Signed-off-by: Russell Bryant <rbryant@redhat.com> --------- Signed-off-by: Russell Bryant <rbryant@redhat.com>	2025-02-02 11:58:18 -08:00
Lucas Wilkinson	cabaf4eff3	[Attention] MLA decode optimizations (#12528 ) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: simon-mo <xmo@berkeley.edu> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: simon-mo <simon.mo@hey.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com> Co-authored-by: Tyler Michael Smith <tysmith@redhat.com> Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com> Co-authored-by: simon-mo <xmo@berkeley.edu>	2025-01-30 23:49:37 -08:00
Lucas Wilkinson	9798b2fb00	[Kernel] Update `cutlass_scaled_mm` to support 2d group (blockwise) scaling (#11868 )	2025-01-30 18:33:00 -08:00
Jinzhen Lin	27b78c73ca	[Kernel] add triton fused moe kernel for gptq/awq (#12185 )	2025-01-29 09:07:09 -05:00
Harry Mellor	823ab79633	Update `pre-commit` hooks (#12475 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2025-01-27 17:23:08 -07:00
Bowen Wang	2bc3fbba0c	[FlashInfer] Upgrade to 0.2.0 (#11194 ) Signed-off-by: Bowen Wang <abmfy@icloud.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com>	2025-01-27 18:19:24 +00:00
Tyler Michael Smith	72f4880425	[Bugfix/CI] Fix broken kernels/test_mha.py (#12450 )	2025-01-26 10:39:03 -08:00
Tyler Michael Smith	aa2cd2c43d	[Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 (#12417 ) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: mgoin <michael@neuralmagic.com>	2025-01-26 19:59:58 +08:00
Isotr0py	f1fc0510df	[Misc] Add FA2 support to ViT MHA layer (#12355 ) Signed-off-by: Isotr0py <2037008807@qq.com>	2025-01-25 15:07:35 +08:00
Lucas Wilkinson	ab5bbf5ae3	[Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build (#12375 ) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>	2025-01-24 15:27:59 +00:00
Gregory Shtrasberg	e97f802b2d	[FP8][Kernel] Dynamic kv cache scaling factors computation (#11906 ) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: Micah Williamson <micah.williamson@amd.com>	2025-01-23 18:04:03 +00:00
Lucas Wilkinson	978b45f399	[Kernel] Flash Attention 3 Support (#12093 ) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>	2025-01-23 06:45:48 -08:00
rasmith	68c4421b6d	[AMD][Quantization] Add TritonScaledMMLinearKernel since int8 is broken for AMD (#12282 ) Signed-off-by: Randall Smith <Randall.Smith@amd.com>	2025-01-23 00:10:37 +00:00
Isotr0py	dd7c9ad870	[Bugfix] Remove hardcoded `head_size=256` for Deepseek v2 and v3 (#12067 ) Signed-off-by: Isotr0py <2037008807@qq.com>	2025-01-16 10:11:54 +00:00
wangxiyuan	3adf0ffda8	[Platform] Do not raise error if _Backend is not found (#12023 ) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-01-15 10:14:15 +00:00
Jee Jee Li	42f5e7c52a	[Kernel] Support MulAndSilu (#11624 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2025-01-15 02:29:53 +00:00
Avshalom Manevich	263a870ee1	[Hardware][TPU] workaround fix for MoE on TPU (#11764 )	2025-01-12 10:53:51 -05:00
Chen Zhang	cf5f000d21	[torch.compile] Hide KV cache behind torch.compile boundary (#11677 ) Signed-off-by: Chen Zhang <zhangch99@outlook.com>	2025-01-10 13:14:42 +08:00
wangxiyuan	405eb8e396	[platform] Allow platform specify attention backend (#11609 ) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>	2025-01-09 21:46:50 +08:00
Chen Zhang	e20c92bb61	[Kernel] Move attn_type to Attention.__init__() (#11690 ) Signed-off-by: Chen Zhang <zhangch99@outlook.com>	2025-01-07 00:11:28 +08:00
Woosuk Kwon	73001445fb	[V1] Implement Cascade Attention (#11635 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-01-01 21:56:46 +09:00
youkaichao	b12e87f942	[platforms] enable platform plugins (#11602 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-12-30 20:24:45 +08:00
Michael Goin	2072924d14	[Model] [Quantization] Support deepseek_v3 w8a8 fp8 block-wise quantization (#11523 ) Signed-off-by: mgoin <michael@neuralmagic.com> Signed-off-by: simon-mo <simon.mo@hey.com> Signed-off-by: simon-mo <xmo@berkeley.edu> Co-authored-by: simon-mo <simon.mo@hey.com> Co-authored-by: simon-mo <xmo@berkeley.edu> Co-authored-by: HandH1998 <1335248067@qq.com>	2024-12-26 15:33:30 -08:00

1 2 3 4 5 ...

252 Commits