270 Commits

Author SHA1 Message Date
Antoni Baum
22de45235c
Push logprob generation to LLMEngine (#3065)
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-03-04 19:54:06 +00:00
Philipp Moritz
17c3103c56
Make it easy to profile workers with nsight (#3162)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-03-03 16:19:13 -08:00
Sage Moore
ce4f5a29fb
Add Automatic Prefix Caching (#2762)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-03-02 00:50:01 -08:00
cloudhan
baee28c46c
Reorder kv dtype check to avoid nvcc not found error on AMD platform (#3104) 2024-03-02 14:34:48 +08:00
Allen.Dou
29e70e3e88
allow user chose log level by --log-level instead of fixed 'info'. (#3109)
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-03-01 23:28:41 +00:00
Robert Shaw
c0c2335ce0
Integrate Marlin Kernels for Int4 GPTQ inference (#2497)
Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>
2024-03-01 12:47:51 -08:00
Allen.Dou
9289e577ec
add cache_config's info to prometheus metrics. (#3100) 2024-02-29 06:15:18 +00:00
Liangfu Chen
3b7178cfa4
[Neuron] Support inference with transformers-neuronx (#2569) 2024-02-28 09:34:34 -08:00
zhaoyang-star
57f044945f
Fix nvcc not found in vlm-openai image (#2781) 2024-02-22 14:25:07 -08:00
Mark Mozolewski
786b7f18a5
Add code-revision config argument for Hugging Face Hub (#2892) 2024-02-17 22:36:53 -08:00
Woosuk Kwon
3711811b1d
Disable custom all reduce by default (#2808) 2024-02-08 09:58:03 -08:00
liuyhwangyh
ed70c70ea3
modelscope: fix issue when model parameter is not a model id but path of the model. (#2489) 2024-02-06 09:57:15 -08:00
Kunshang Ji
96b6f475dd
Remove hardcoded device="cuda" to support more devices (#2503)
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2024-02-01 15:46:39 -08:00
zspo
c664b0e683
fix some bugs (#2689) 2024-01-31 10:09:23 -08:00
zhaoyang-star
9090bf02e7
Support FP8-E5M2 KV Cache (#2279)
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-28 16:43:54 -08:00
Hanzhi Zhou
380170038e
Implement custom all reduce kernels (#2192) 2024-01-27 12:46:35 -08:00
Xiang Xu
220a47627b
Use head_dim in config if exists (#2622) 2024-01-27 10:30:49 -08:00
Antoni Baum
9b945daaf1
[Experimental] Add multi-LoRA support (#1804)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-01-23 15:26:37 -08:00
Woosuk Kwon
6ef00b03a2
Enable CUDA graph for GPTQ & SqueezeLLM (#2318) 2024-01-03 09:52:29 -08:00
kliuae
1b7c791d60
[ROCm] Fixes for GPTQ on ROCm (#2180) 2023-12-18 10:41:04 -08:00
Woosuk Kwon
6f41f0e377
Disable CUDA graph for SqueezeLLM (#2161) 2023-12-17 10:24:25 -08:00
Woosuk Kwon
3a765bd5e1
Temporarily enforce eager mode for GPTQ models (#2154) 2023-12-17 01:51:12 -08:00
Woosuk Kwon
37ca558103
Optimize model execution with CUDA graph (#1926)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2023-12-16 21:12:08 -08:00
Roy
eed74a558f
Simplify weight loading logic (#2133) 2023-12-16 12:41:23 -08:00
Woosuk Kwon
2acd76f346
[ROCm] Temporarily remove GPTQ ROCm support (#2138) 2023-12-15 17:13:58 -08:00
CHU Tianxiang
0fbfc4b81b
Add GPTQ support (#916) 2023-12-15 03:04:22 -08:00
Antoni Baum
21d93c140d
Optimize Mixtral with expert parallelism (#2090) 2023-12-13 23:55:07 -08:00
Woosuk Kwon
b9bcdc7158
Change the load format to pt for Mixtral (#2028) 2023-12-11 10:32:17 -08:00
TJian
6ccc0bfffb
Merge EmbeddedLLM/vllm-rocm into vLLM main (#1836)
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Amir Balwel <amoooori04@gmail.com>
Co-authored-by: root <kuanfu.liu@akirakan.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: kuanfu <kuanfu.liu@embeddedllm.com>
Co-authored-by: miloice <17350011+kliuae@users.noreply.github.com>
2023-12-07 23:16:52 -08:00
Woosuk Kwon
27feead2f8
Refactor Worker & InputMetadata (#1843) 2023-11-29 22:16:37 -08:00
boydfd
4bb6b67188
fix RAM OOM when load large models in tensor parallel mode. (#1395)
Co-authored-by: ran_lin <rlin@thoughtworks.com>
2023-11-20 19:02:42 -08:00
Woosuk Kwon
be66d9b125
Fix warning msg on quantization (#1715) 2023-11-18 21:49:55 -08:00
liuyhwangyh
edb305584b
Support download models from www.modelscope.cn (#1588) 2023-11-17 20:38:31 -08:00
Woosuk Kwon
bb00f66e19
Use quantization_config in hf config (#1695) 2023-11-17 16:23:49 -08:00
Aaron Pham
65ea2ddf17
feat(config): support parsing torch.dtype (#1641)
Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com>
2023-11-16 01:31:06 -08:00
Zhuohan Li
7076fa1c9f
TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models (#1622)
Refactor the tensor parallelism, quantization, and weight-loading codes.

Summary of the new features enabled by this PR:
- **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](https://github.com/vllm-project/vllm/pull/1580).
- Model loading code became much simpler.
- Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.
2023-11-15 22:50:41 -08:00
Sin
0d578228ca
config parser: add ChatGLM2 seq_length to _get_and_verify_max_len (#1617) 2023-11-09 19:29:51 -08:00
GoHomeToMacDonal
1a2bbc9301
ChatGLM Support (#1261) 2023-11-06 16:09:33 -08:00
Antoni Baum
9f669a9a7c
Support YaRN models (#1264)
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Viktor Ferenczi <viktor@ferenczi.eu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-11-03 14:12:48 -07:00
chooper1
1f24755bf8
Support SqueezeLLM (#1326)
Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-10-21 23:14:59 -07:00
Woosuk Kwon
c1376e0f82
Change scheduler & input tensor shape (#1381) 2023-10-16 17:48:42 -07:00
Zhuohan Li
9d9072a069
Implement prompt logprobs & Batched topk for computing logprobs (#1328)
Co-authored-by: Yunmo Chen <16273544+wanmok@users.noreply.github.com>
2023-10-16 10:56:50 -07:00
Antoni Baum
ee92b58b3a
Move bfloat16 check to worker (#1259) 2023-10-07 22:10:44 -07:00
Federico Cassano
66d18a7fb0
add support for tokenizer revision (#1163)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-10-02 19:19:46 -07:00
Woosuk Kwon
f936657eb6
Provide default max model length (#1224) 2023-09-28 14:44:02 -07:00
Chris Bamford
bb1ba58f06
[Mistral] Mistral-7B-v0.1 support (#1196)
Co-authored-by: timlacroix <t@mistral.ai>
2023-09-28 10:41:03 -07:00
Woosuk Kwon
a19bc5c628
Automatically configure max_num_batched_tokens (#1198) 2023-09-27 16:34:00 -07:00
Lily Liu
21877b0d75
Support Longchat and RoPE scaling (#555)
Co-authored-by: Wing Lian <wing.lian@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-09-27 03:36:02 -07:00
Woosuk Kwon
9f6be8692e
Fix config for Falcon (#1164) 2023-09-23 17:38:43 -07:00
Antoni Baum
3302f0aef3
rope_theta and max_position_embeddings from config (#1096)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: wnma3mz <wnma3mz@gmail.com>
2023-09-20 13:35:11 -07:00