Roy
f1c0fc3919
Migrate logits
computation and gather to model_runner
( #3233 )
2024-03-20 23:25:01 +00:00
Woosuk Kwon
2daf23ab0c
Separate attention backends ( #3005 )
2024-03-07 01:45:50 -08:00
Woosuk Kwon
929b4f2973
Add LoRA support for Gemma ( #3050 )
2024-02-28 13:03:28 -08:00
Roy
344020c926
Migrate MistralForCausalLM to LlamaForCausalLM ( #2868 )
2024-02-21 18:25:05 -08:00
Philipp Moritz
0c48b37c31
Fix internlm after https://github.com/vllm-project/vllm/pull/2860 ( #2861 )
2024-02-13 18:01:15 -08:00
Philipp Moritz
7eacffd951
Migrate InternLMForCausalLM to LlamaForCausalLM ( #2860 )
...
Co-authored-by: Roy <jasonailu87@gmail.com>
2024-02-13 17:12:05 -08:00
Terry
2a543d6efe
Add LoRA support for Mixtral ( #2831 )
...
* add mixtral lora support
* formatting
* fix incorrectly ported logic
* polish tests
* minor fixes and refactoring
* minor fixes
* formatting
* rename and remove redundant logic
* refactoring
* refactoring
* minor fix
* minor refactoring
* fix code smell
2024-02-14 00:55:45 +01:00
Philipp Moritz
ea356004d4
Revert "Refactor llama family models ( #2637 )" ( #2851 )
...
This reverts commit 5c976a7e1a1bec875bf6474824b7dff39e38de18.
2024-02-13 09:24:59 -08:00
Roy
5c976a7e1a
Refactor llama family models ( #2637 )
2024-02-13 00:09:23 -08:00
Antoni Baum
9b945daaf1
[Experimental] Add multi-LoRA support ( #1804 )
...
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-01-23 15:26:37 -08:00
Zhuohan Li
fd4ea8ef5c
Use NCCL instead of ray for control-plane communication to remove serialization overhead ( #2221 )
2024-01-03 11:30:22 -08:00
Woosuk Kwon
37ca558103
Optimize model execution with CUDA graph ( #1926 )
...
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2023-12-16 21:12:08 -08:00
CHU Tianxiang
0fbfc4b81b
Add GPTQ support ( #916 )
2023-12-15 03:04:22 -08:00
Woosuk Kwon
24cde76a15
[Minor] Add comment on skipping rope caches ( #2004 )
2023-12-10 10:04:12 -08:00
Jun Gao
3a8c2381f7
Fix for KeyError on Loading LLaMA ( #1978 )
2023-12-09 15:59:57 -08:00
Woosuk Kwon
27feead2f8
Refactor Worker & InputMetadata ( #1843 )
2023-11-29 22:16:37 -08:00
Woosuk Kwon
a9e4574261
Refactor Attention ( #1840 )
2023-11-29 15:37:31 -08:00
Woosuk Kwon
7c600440f7
Fix model docstrings ( #1764 )
2023-11-23 23:04:44 -08:00
Simon Mo
5ffc0d13a2
Migrate linter from pylint
to ruff
( #1665 )
2023-11-20 11:58:01 -08:00
ljss
e1054247ba
[Optimization] Implement fused add rmsnorm ( #1667 )
2023-11-18 18:18:02 -08:00
Zhuohan Li
7076fa1c9f
TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models ( #1622 )
...
Refactor the tensor parallelism, quantization, and weight-loading codes.
Summary of the new features enabled by this PR:
- **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](https://github.com/vllm-project/vllm/pull/1580 ).
- Model loading code became much simpler.
- Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.
2023-11-15 22:50:41 -08:00
chooper1
1f24755bf8
Support SqueezeLLM ( #1326 )
...
Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-10-21 23:14:59 -07:00
Zhuohan Li
ba0bfd40e2
TP/quantization/weight loading refactor part 1 - Simplify parallel linear logic ( #1181 )
2023-10-02 15:36:09 -07:00
Zhuohan Li
a60b353005
support sharding llama2-70b on more than 8 GPUs ( #1209 )
...
Co-authored-by: JiCheng <247153481@qq.com>
2023-10-02 15:26:33 -07:00
Lily Liu
21877b0d75
Support Longchat and RoPE scaling ( #555 )
...
Co-authored-by: Wing Lian <wing.lian@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-09-27 03:36:02 -07:00
Antoni Baum
3302f0aef3
rope_theta and max_position_embeddings from config ( #1096 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: wnma3mz <wnma3mz@gmail.com>
2023-09-20 13:35:11 -07:00
Woosuk Kwon
cc796b1358
Convert before transpose ( #1073 )
2023-09-18 11:51:48 -07:00
Woosuk Kwon
e3e79e9e8a
Implement AWQ quantization support for LLaMA ( #1032 )
...
Co-authored-by: Robert Irvine <robert@seamlessml.com>
Co-authored-by: root <rirv938@gmail.com>
Co-authored-by: Casper <casperbh.96@gmail.com>
Co-authored-by: julian-q <julianhquevedo@gmail.com>
2023-09-16 00:03:37 -07:00
Jasmond L
ab019eea75
Add Model Revision Support ( #1014 )
...
Co-authored-by: Jasmond Loh <Jasmond.Loh@hotmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-09-13 15:20:02 -07:00
Zhuohan Li
c957c741d9
Enable safetensors loading for all models ( #974 )
2023-09-07 15:49:52 -07:00
Zhuohan Li
002800f081
Align vLLM's beam search implementation with HF generate ( #857 )
2023-09-04 17:29:42 -07:00
JFDuan
0d93f15694
Accelerate LLaMA model loading ( #234 )
2023-08-30 01:00:13 -07:00
Antoni Baum
4b6f069b6f
Add support for CodeLlama ( #854 )
2023-08-25 12:44:07 -07:00
Zhuohan Li
6fc2a38b11
Add support for LLaMA-2 ( #505 )
2023-07-20 11:38:27 -07:00
panda
7b6ae94059
add vocab padding for LLama(Support WizardLM) ( #411 )
2023-07-13 23:56:22 -04:00
Zhuohan Li
d6fa1be3a8
[Quality] Add code formatter and linter ( #326 )
2023-07-03 11:31:55 -07:00
Woosuk Kwon
0b98ba15c7
Change the name to vLLM ( #150 )
2023-06-17 03:07:40 -07:00