20231088/vllm - vllm - Luminance Code Repo

20231088/vllm

Author	SHA1	Message	Date
DefTruth	0f9a6e3d22	[Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi (#4573 )	2024-05-08 09:19:58 -07:00
SangBin Cho	3521ba4f25	[Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518 )	2024-05-03 10:20:12 -07:00
Michał Moskal	32881f3f31	[kernel] fix sliding window in prefix prefill Triton kernel (#4405 ) Co-authored-by: SangBin Cho <rkooo567@gmail.com>	2024-05-02 11:23:37 -07:00
Michał Moskal	e8cc7967ff	[Bugfix][Kernel] allow non-power-of-two head sizes in prefix prefill (#4128 )	2024-04-18 00:51:28 -07:00
Roger Wang	45b6ef6513	feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark (#3277 )	2024-03-27 13:39:26 -07:00
SangBin Cho	01bfb22b41	[CI] Try introducing isort. (#3495 )	2024-03-25 07:59:47 -07:00
Woosuk Kwon	925f3332ca	[Core] Refactor Attention Take 2 (#3462 )	2024-03-25 04:39:33 +00:00
simon-mo	ad50bf4b25	fix lint	2024-03-15 22:23:38 -07:00
Tao He	3123f15138	Fixes the incorrect argument in the prefix-prefill test cases (#3246 )	2024-03-15 20:58:10 -07:00
Zhuohan Li	2f8844ba08	Re-enable the 80 char line width limit (#3305 )	2024-03-10 19:49:14 -07:00
Woosuk Kwon	2daf23ab0c	Separate attention backends (#3005 )	2024-03-07 01:45:50 -08:00
Tao He	71bcaf99e2	Enable GQA support in the prefix prefill kernels (#3007 ) Signed-off-by: Tao He <sighingnow@gmail.com>	2024-02-27 01:14:31 -08:00
Kunshang Ji	96b6f475dd	Remove hardcoded `device="cuda"` to support more devices (#2503 ) Co-authored-by: Jiang Li <jiang1.li@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>	2024-02-01 15:46:39 -08:00
Jason Zhu	7a0b011dd5	Add a 1-line docstring to explain why calling context_attention_fwd twice in test_prefix_prefill.py (#2553 )	2024-01-22 14:47:25 -08:00
shiyi.c_98	d10f8e1d43	[Experimental] Prefix Caching Support (#1669 ) Co-authored-by: DouHappy <2278958187@qq.com> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2024-01-17 16:32:10 -08:00

15 Commits