Antoni Baum
22de45235c
Push logprob generation to LLMEngine ( #3065 )
...
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-03-04 19:54:06 +00:00
Philipp Moritz
17c3103c56
Make it easy to profile workers with nsight ( #3162 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-03-03 16:19:13 -08:00
Sage Moore
ce4f5a29fb
Add Automatic Prefix Caching ( #2762 )
...
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-03-02 00:50:01 -08:00
cloudhan
baee28c46c
Reorder kv dtype check to avoid nvcc not found error on AMD platform ( #3104 )
2024-03-02 14:34:48 +08:00
Allen.Dou
29e70e3e88
allow user chose log level by --log-level instead of fixed 'info'. ( #3109 )
...
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-03-01 23:28:41 +00:00
Robert Shaw
c0c2335ce0
Integrate Marlin Kernels for Int4 GPTQ inference ( #2497 )
...
Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>
2024-03-01 12:47:51 -08:00
Allen.Dou
9289e577ec
add cache_config's info to prometheus metrics. ( #3100 )
2024-02-29 06:15:18 +00:00
Liangfu Chen
3b7178cfa4
[Neuron] Support inference with transformers-neuronx ( #2569 )
2024-02-28 09:34:34 -08:00
zhaoyang-star
57f044945f
Fix nvcc not found in vlm-openai image ( #2781 )
2024-02-22 14:25:07 -08:00
Mark Mozolewski
786b7f18a5
Add code-revision config argument for Hugging Face Hub ( #2892 )
2024-02-17 22:36:53 -08:00
Woosuk Kwon
3711811b1d
Disable custom all reduce by default ( #2808 )
2024-02-08 09:58:03 -08:00
liuyhwangyh
ed70c70ea3
modelscope: fix issue when model parameter is not a model id but path of the model. ( #2489 )
2024-02-06 09:57:15 -08:00
Kunshang Ji
96b6f475dd
Remove hardcoded device="cuda"
to support more devices ( #2503 )
...
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2024-02-01 15:46:39 -08:00
zspo
c664b0e683
fix some bugs ( #2689 )
2024-01-31 10:09:23 -08:00
zhaoyang-star
9090bf02e7
Support FP8-E5M2 KV Cache ( #2279 )
...
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-28 16:43:54 -08:00
Hanzhi Zhou
380170038e
Implement custom all reduce kernels ( #2192 )
2024-01-27 12:46:35 -08:00
Xiang Xu
220a47627b
Use head_dim in config if exists ( #2622 )
2024-01-27 10:30:49 -08:00
Antoni Baum
9b945daaf1
[Experimental] Add multi-LoRA support ( #1804 )
...
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-01-23 15:26:37 -08:00
Woosuk Kwon
6ef00b03a2
Enable CUDA graph for GPTQ & SqueezeLLM ( #2318 )
2024-01-03 09:52:29 -08:00
kliuae
1b7c791d60
[ROCm] Fixes for GPTQ on ROCm ( #2180 )
2023-12-18 10:41:04 -08:00
Woosuk Kwon
6f41f0e377
Disable CUDA graph for SqueezeLLM ( #2161 )
2023-12-17 10:24:25 -08:00
Woosuk Kwon
3a765bd5e1
Temporarily enforce eager mode for GPTQ models ( #2154 )
2023-12-17 01:51:12 -08:00
Woosuk Kwon
37ca558103
Optimize model execution with CUDA graph ( #1926 )
...
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2023-12-16 21:12:08 -08:00
Roy
eed74a558f
Simplify weight loading logic ( #2133 )
2023-12-16 12:41:23 -08:00
Woosuk Kwon
2acd76f346
[ROCm] Temporarily remove GPTQ ROCm support ( #2138 )
2023-12-15 17:13:58 -08:00
CHU Tianxiang
0fbfc4b81b
Add GPTQ support ( #916 )
2023-12-15 03:04:22 -08:00
Antoni Baum
21d93c140d
Optimize Mixtral with expert parallelism ( #2090 )
2023-12-13 23:55:07 -08:00
Woosuk Kwon
b9bcdc7158
Change the load format to pt for Mixtral ( #2028 )
2023-12-11 10:32:17 -08:00
TJian
6ccc0bfffb
Merge EmbeddedLLM/vllm-rocm into vLLM main ( #1836 )
...
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Amir Balwel <amoooori04@gmail.com>
Co-authored-by: root <kuanfu.liu@akirakan.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: kuanfu <kuanfu.liu@embeddedllm.com>
Co-authored-by: miloice <17350011+kliuae@users.noreply.github.com>
2023-12-07 23:16:52 -08:00
Woosuk Kwon
27feead2f8
Refactor Worker & InputMetadata ( #1843 )
2023-11-29 22:16:37 -08:00
boydfd
4bb6b67188
fix RAM OOM when load large models in tensor parallel mode. ( #1395 )
...
Co-authored-by: ran_lin <rlin@thoughtworks.com>
2023-11-20 19:02:42 -08:00
Woosuk Kwon
be66d9b125
Fix warning msg on quantization ( #1715 )
2023-11-18 21:49:55 -08:00
liuyhwangyh
edb305584b
Support download models from www.modelscope.cn ( #1588 )
2023-11-17 20:38:31 -08:00
Woosuk Kwon
bb00f66e19
Use quantization_config
in hf config ( #1695 )
2023-11-17 16:23:49 -08:00
Aaron Pham
65ea2ddf17
feat(config): support parsing torch.dtype ( #1641 )
...
Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com>
2023-11-16 01:31:06 -08:00
Zhuohan Li
7076fa1c9f
TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models ( #1622 )
...
Refactor the tensor parallelism, quantization, and weight-loading codes.
Summary of the new features enabled by this PR:
- **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](https://github.com/vllm-project/vllm/pull/1580 ).
- Model loading code became much simpler.
- Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.
2023-11-15 22:50:41 -08:00
Sin
0d578228ca
config parser: add ChatGLM2 seq_length to _get_and_verify_max_len
( #1617 )
2023-11-09 19:29:51 -08:00
GoHomeToMacDonal
1a2bbc9301
ChatGLM Support ( #1261 )
2023-11-06 16:09:33 -08:00
Antoni Baum
9f669a9a7c
Support YaRN models ( #1264 )
...
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Viktor Ferenczi <viktor@ferenczi.eu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-11-03 14:12:48 -07:00
chooper1
1f24755bf8
Support SqueezeLLM ( #1326 )
...
Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-10-21 23:14:59 -07:00
Woosuk Kwon
c1376e0f82
Change scheduler & input tensor shape ( #1381 )
2023-10-16 17:48:42 -07:00
Zhuohan Li
9d9072a069
Implement prompt logprobs & Batched topk for computing logprobs ( #1328 )
...
Co-authored-by: Yunmo Chen <16273544+wanmok@users.noreply.github.com>
2023-10-16 10:56:50 -07:00
Antoni Baum
ee92b58b3a
Move bfloat16 check to worker ( #1259 )
2023-10-07 22:10:44 -07:00
Federico Cassano
66d18a7fb0
add support for tokenizer revision ( #1163 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-10-02 19:19:46 -07:00
Woosuk Kwon
f936657eb6
Provide default max model length ( #1224 )
2023-09-28 14:44:02 -07:00
Chris Bamford
bb1ba58f06
[Mistral] Mistral-7B-v0.1 support ( #1196 )
...
Co-authored-by: timlacroix <t@mistral.ai>
2023-09-28 10:41:03 -07:00
Woosuk Kwon
a19bc5c628
Automatically configure max_num_batched_tokens
( #1198 )
2023-09-27 16:34:00 -07:00
Lily Liu
21877b0d75
Support Longchat and RoPE scaling ( #555 )
...
Co-authored-by: Wing Lian <wing.lian@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-09-27 03:36:02 -07:00
Woosuk Kwon
9f6be8692e
Fix config for Falcon ( #1164 )
2023-09-23 17:38:43 -07:00
Antoni Baum
3302f0aef3
rope_theta and max_position_embeddings from config ( #1096 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: wnma3mz <wnma3mz@gmail.com>
2023-09-20 13:35:11 -07:00