20231088/vllm - vllm - Luminance Code Repo

20231088/vllm

Author	SHA1	Message	Date
Dinghow Yang	253a98078a	Add chat templates for ChatGLM (#3418 )	2024-03-14 23:19:22 -07:00
Dinghow Yang	21539e6856	Add chat templates for Falcon (#3420 )	2024-03-14 23:19:02 -07:00
youkaichao	b522c4476f	[Misc] add HOST_IP env var (#3419 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-03-14 21:32:52 -07:00
akhoroshev	78b6c4845a	Dynamically configure shared memory size for moe_align_block_size_kernel (#3376 )	2024-03-14 18:18:07 -07:00
Enrique Shockwave	b983ba35bd	fix marlin config repr (#3414 )	2024-03-14 16:26:19 -07:00
陈序	54be8a0be2	Fix assertion failure in Qwen 1.5 with prefix caching enabled (#3373 ) Co-authored-by: Cade Daniel <edacih@gmail.com>	2024-03-14 13:56:57 -07:00
youkaichao	dfc77408bd	[issue templates] add some issue templates (#3412 )	2024-03-14 13:16:00 -07:00
Dan Clark	c17ca8ef18	Add args for mTLS support (#3410 ) Co-authored-by: Daniel Clark <daniel.clark@ibm.com>	2024-03-14 13:11:45 -07:00
Thomas Parnell	06ec486794	Install `flash_attn` in Docker image (#3396 )	2024-03-14 10:55:54 -07:00
youkaichao	8fe8386591	[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 (#3389 )	2024-03-14 08:11:48 +00:00
Allen.Dou	a37415c31b	allow user to chose which vllm's merics to display in grafana (#3393 )	2024-03-14 06:35:13 +00:00
Simon Mo	81653d9688	[Hotfix] [Debug] test_openai_server.py::test_guided_regex_completion (#3383 )	2024-03-13 17:02:21 -07:00
Zhuohan Li	eeab52a4ff	[FIX] Simpler fix for async engine running on ray (#3371 )	2024-03-13 14:18:40 -07:00
Antoni Baum	c33afd89f5	Fix lint (#3388 )	2024-03-13 13:56:49 -07:00
Terry	7e9bd08f60	Add batched RoPE kernel (#3095 )	2024-03-13 13:45:26 -07:00
Or Sharir	ae0ccb4017	Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. (#3350 )	2024-03-13 12:18:25 -07:00
陈序	739c350c19	[Minor Fix] Use cupy-cuda11x in CUDA 11.8 build (#3256 )	2024-03-13 09:43:24 -07:00
Hui Liu	ba8dc958a3	[Minor] Fix bias in if to remove ambiguity (#3259 )	2024-03-13 09:16:55 -07:00
Ronan McGovern	e221910e77	add hf_transfer to requirements.txt (#3031 )	2024-03-12 23:33:43 -07:00
Bo-Wen Wang	b167109ba1	[Fix] Fix quantization="gptq" when using Marlin (#3319 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-03-12 22:51:42 -07:00
Woosuk Kwon	602358f8a8	Add kernel for GeGLU with approximate GELU (#3337 )	2024-03-12 22:06:17 -07:00
Breno Faria	49a3c8662b	Fixes #1556 double free (#3347 )	2024-03-13 00:30:08 +00:00
Sherlock Xu	b0925b3878	docs: Add BentoML deployment doc (#3336 ) Signed-off-by: Sherlock113 <sherlockxu07@gmail.com>	2024-03-12 10:34:30 -07:00
DAIZHENWEI	654865e21d	Support Mistral Model Inference with transformers-neuronx (#3153 )	2024-03-11 13:19:51 -07:00
kliuae	c9415c19d3	[ROCm] Fix warp and lane calculation in blockReduceSum (#3321 )	2024-03-11 13:14:07 -07:00
Zhuohan Li	4c922709b6	Add distributed model executor abstraction (#3191 )	2024-03-11 11:03:45 -07:00
Philipp Moritz	657061fdce	[docs] Add LoRA support information for models (#3299 )	2024-03-11 00:54:51 -07:00
Zhuohan Li	2f8844ba08	Re-enable the 80 char line width limit (#3305 )	2024-03-10 19:49:14 -07:00
Nick Hill	4b59f00e91	[Fix] Fix best_of behavior when n=1 (#3298 )	2024-03-10 19:17:46 -07:00
Roy	9e8744a545	[BugFix] Fix get tokenizer when using ray (#3301 )	2024-03-10 19:17:16 -07:00
Douglas Lehr	e4a28e5316	[ROCM] Fix blockReduceSum to use correct warp counts for ROCm and CUDA (#3262 )	2024-03-10 15:27:45 -07:00
Terry	0bba88df03	Enhance lora tests with more layer and rank variations (#3243 )	2024-03-09 17:14:16 -08:00
Cade Daniel	8437bae6ef	[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling (#3103 )	2024-03-08 23:32:46 -08:00
Zhuohan Li	f48c6791b7	[FIX] Fix prefix test error on main (#3286 )	2024-03-08 17:16:14 -08:00
Michael Goin	c2c5e0909a	Move model filelocks from `/tmp/` to `~/.cache/vllm/locks/` dir (#3241 )	2024-03-08 13:33:10 -08:00
Woosuk Kwon	1cb0cc2975	[FIX] Make `flash_attn` optional (#3269 )	2024-03-08 10:52:20 -08:00
Roger Wang	99c3cfb83c	[Docs] Fix Unmocked Imports (#3275 )	2024-03-08 09:58:01 -08:00
TianYu GUO	1ece1ae829	[Minor Fix] Fix comments in benchmark_serving (#3252 )	2024-03-07 22:22:59 -08:00
whyiug	c59e120c55	Feature add lora support for Qwen2 (#3177 )	2024-03-07 21:58:24 -08:00
Nick Hill	d2339d6840	Connect engine healthcheck to openai server (#3260 )	2024-03-07 16:38:12 -08:00
ElizaWszola	b35cc93420	Fix auto prefix bug (#3239 )	2024-03-07 16:37:28 -08:00
jacobthebanana	8cbba4622c	Possible fix for conflict between Automated Prefix Caching (#2762 ) and multi-LoRA support (#1804 ) (#3263 )	2024-03-07 23:03:22 +00:00
Michael Goin	385da2dae2	Measure model memory usage (#3120 )	2024-03-07 11:42:42 -08:00
Woosuk Kwon	2daf23ab0c	Separate attention backends (#3005 )	2024-03-07 01:45:50 -08:00
Chen Wang	cbf4c05b15	Update requirements-dev.txt to include package for benchmarking scripts. (#3181 ) Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2024-03-07 08:39:28 +00:00
TechxGenus	d3c04b6a39	Add GPTQ support for Gemma (#3200 )	2024-03-07 08:19:14 +08:00
Chujie Zheng	4cb3b924cd	Add tqdm `dynamic_ncols=True` (#3242 )	2024-03-06 22:41:42 +00:00
Cade Daniel	a33ce60c66	[Testing] Fix core tests (#3224 )	2024-03-06 01:04:23 -08:00
SangBin Cho	24aecf421a	[Tests] Add block manager and scheduler tests (#3108 )	2024-03-05 18:23:34 -08:00
Nick Hill	2efce05dc3	[Fix] Avoid pickling entire LLMEngine for Ray workers (#3207 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-03-06 00:17:20 +00:00

1 2 3 4 5 ...

877 Commits