20231088/vllm - vllm - Luminance Code Repo

20231088/vllm

Author	SHA1	Message	Date
Robert Shaw	c0c2335ce0	Integrate Marlin Kernels for Int4 GPTQ inference (#2497 ) Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com> Co-authored-by: alexm <alexm@neuralmagic.com>	2024-03-01 12:47:51 -08:00
felixzhu555	703e42ee4b	Add guided decoding for OpenAI API server (#2819 ) Co-authored-by: br3no <breno@veltefaria.de> Co-authored-by: simon-mo <simon.mo@hey.com>	2024-02-29 22:13:08 +00:00
Seonghyeon	bfdcfa6a05	Support starcoder2 architecture (#3089 )	2024-02-29 00:51:48 -08:00
Woosuk Kwon	929b4f2973	Add LoRA support for Gemma (#3050 )	2024-02-28 13:03:28 -08:00
Liangfu Chen	3b7178cfa4	[Neuron] Support inference with transformers-neuronx (#2569 )	2024-02-28 09:34:34 -08:00
Tao He	71bcaf99e2	Enable GQA support in the prefix prefill kernels (#3007 ) Signed-off-by: Tao He <sighingnow@gmail.com>	2024-02-27 01:14:31 -08:00
Dylan Hawk	e0ade06d63	Support logit bias for OpenAI API (#3027 )	2024-02-27 11:51:53 +08:00
Jared Moore	70f3e8e3a1	Add LogProbs for Chat Completions in OpenAI (#2918 )	2024-02-26 10:39:34 +08:00
Harry Mellor	ef978fe411	Port metrics from `aioprometheus` to `prometheus_client` (#2730 )	2024-02-25 11:54:00 -08:00
Ronen Schaffer	4caf7044e0	Include tokens from prompt phase in `counter_generation_tokens` (#2802 )	2024-02-22 14:00:12 -08:00
Woosuk Kwon	fd5dcc5c81	Optimize GeGLU layer in Gemma (#2975 )	2024-02-21 20:17:52 -08:00
Massimiliano Pronesti	93dc5a2870	chore(vllm): codespell for spell checking (#2820 )	2024-02-21 18:56:01 -08:00
Nick Hill	7d2dcce175	Support per-request seed (#2514 )	2024-02-21 11:47:00 -08:00
Antoni Baum	017d9f1515	Add metrics to RequestOutput (#2876 )	2024-02-20 21:55:57 -08:00
Zhuohan Li	63e2a6419d	[FIX] Fix beam search test (#2930 )	2024-02-20 14:37:39 -08:00
Ronen Schaffer	e433c115bc	Fix `vllm:prompt_tokens_total` metric calculation (#2869 )	2024-02-18 23:55:41 -08:00
Isotr0py	ab3a5a8259	Support OLMo models. (#2832 )	2024-02-18 21:05:15 -08:00
Zhuohan Li	a61f0521b8	[Test] Add basic correctness test (#2908 )	2024-02-18 16:44:50 -08:00
jvmncs	8f36444c4f	multi-LoRA as extra models in OpenAI server (#2775 ) how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)): ```terminal $ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/ $ python -m vllm.entrypoints.api_server \ --model meta-llama/Llama-2-7b-hf \ --enable-lora \ --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH ``` the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs no work has been done here to scope client permissions to specific models	2024-02-17 12:00:48 -08:00
Woosuk Kwon	d7afab6d3a	[BugFix] Fix GC bug for `LLM` class (#2882 )	2024-02-14 22:17:44 -08:00
Terry	2a543d6efe	Add LoRA support for Mixtral (#2831 ) * add mixtral lora support * formatting * fix incorrectly ported logic * polish tests * minor fixes and refactoring * minor fixes * formatting * rename and remove redundant logic * refactoring * refactoring * minor fix * minor refactoring * fix code smell	2024-02-14 00:55:45 +01:00
Lily Liu	fe6d09ae61	[Minor] More fix of test_cache.py CI test failure (#2750 )	2024-02-06 11:38:38 -08:00
Woosuk Kwon	f0d4e14557	Add fused top-K softmax kernel for MoE (#2769 )	2024-02-05 17:38:02 -08:00
Hongxia Yang	56f738ae9b	[ROCm] Fix some kernels failed unit tests (#2498 )	2024-02-05 14:25:36 -08:00
Kunshang Ji	96b6f475dd	Remove hardcoded `device="cuda"` to support more devices (#2503 ) Co-authored-by: Jiang Li <jiang1.li@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>	2024-02-01 15:46:39 -08:00
Philipp Moritz	d0d93b92b1	Add unit test for Mixtral MoE layer (#2677 )	2024-01-31 14:34:17 -08:00
Philipp Moritz	89efcf1ce5	[Minor] Fix test_cache.py CI test failure (#2684 )	2024-01-31 10:12:11 -08:00
Vladimir	4f65af0e25	Add swap_blocks unit tests (#2616 )	2024-01-30 09:30:50 -08:00
wangding zeng	5d60def02c	DeepseekMoE support with Fused MoE kernel (#2453 ) Co-authored-by: roy <jasonailu87@gmail.com>	2024-01-29 21:19:48 -08:00
zhaoyang-star	9090bf02e7	Support FP8-E5M2 KV Cache (#2279 ) Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2024-01-28 16:43:54 -08:00
Hanzhi Zhou	380170038e	Implement custom all reduce kernels (#2192 )	2024-01-27 12:46:35 -08:00
Simon Mo	3a7dd7e367	Support Batch Completion in Server (#2529 )	2024-01-24 17:11:07 -08:00
Nikola Borisov	3209b49033	[Bugfix] fix crash if max_tokens=None (#2570 )	2024-01-23 22:38:55 -08:00
Antoni Baum	9b945daaf1	[Experimental] Add multi-LoRA support (#1804 ) Co-authored-by: Chen Shen <scv119@gmail.com> Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com> Co-authored-by: Avnish Narayan <avnish@anyscale.com>	2024-01-23 15:26:37 -08:00
Jason Zhu	7a0b011dd5	Add a 1-line docstring to explain why calling context_attention_fwd twice in test_prefix_prefill.py (#2553 )	2024-01-22 14:47:25 -08:00
Cade Daniel	18bfcdd05c	[Speculative decoding 2/9] Multi-step worker for draft model (#2424 )	2024-01-21 16:31:47 -08:00
Zhuohan Li	ef9b636e2d	Simplify broadcast logic for control messages (#2501 )	2024-01-19 11:23:30 -08:00
Simon Mo	dd7e8f5f64	refactor complemention api for readability (#2499 )	2024-01-18 16:45:14 -08:00
shiyi.c_98	d10f8e1d43	[Experimental] Prefix Caching Support (#1669 ) Co-authored-by: DouHappy <2278958187@qq.com> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2024-01-17 16:32:10 -08:00
FlorianJoncour	14cc317ba4	OpenAI Server refactoring (#2360 )	2024-01-16 21:33:14 -08:00
Hyunsung Lee	e1957c6ebd	Add StableLM3B model (#2372 )	2024-01-16 20:32:40 -08:00
Simon Mo	6e01e8c1c8	[CI] Add Buildkite (#2355 )	2024-01-14 12:37:58 -08:00
陈序	218dc2ccda	Aligning `top_p` and `top_k` Sampling (#1885 ) * Align top_p and top_k with huggingface * remove _get_prompt_and_output_tokens * rename _apply_top_p_top_k * compare top_p top_k with hf * fix test errors	2024-01-12 22:51:03 +01:00
Cade Daniel	79d64c4954	[Speculative decoding 1/9] Optimized rejection sampler (#2336 )	2024-01-09 15:38:41 -08:00
Woosuk Kwon	941767127c	Revert the changes in test_cache (#2335 )	2024-01-03 17:32:05 -08:00
Zhuohan Li	fd4ea8ef5c	Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221 )	2024-01-03 11:30:22 -08:00
Jee Li	77af974b40	[FIX] Support non-zero CUDA devices in custom kernels (#1959 )	2024-01-02 19:09:59 -08:00
Zhuohan Li	358c328d69	[BUGFIX] Fix communication test (#2285 )	2023-12-27 17:18:11 -05:00
Zhuohan Li	4aaafdd289	[BUGFIX] Fix the path of test prompts (#2273 )	2023-12-26 10:37:21 -08:00
Zhuohan Li	66b108d142	[BUGFIX] Fix API server test (#2270 )	2023-12-26 10:37:06 -08:00

1 2 3 4

163 Commits