Iskren Ivov Chernev
|
d0215a58e7
|
Ensure metrics are logged regardless of requests (#2347)
|
2024-01-05 05:24:42 -08:00 |
|
Zhuohan Li
|
fd4ea8ef5c
|
Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221)
|
2024-01-03 11:30:22 -08:00 |
|
Zhuohan Li
|
e0ff920001
|
[BUGFIX] Do not return ignored sentences twice in async llm engine (#2258)
|
2023-12-26 13:41:09 +08:00 |
|
Woosuk Kwon
|
3a4fd5ca59
|
Disable Ray usage stats collection (#2206)
|
2023-12-20 21:52:08 -08:00 |
|
Woosuk Kwon
|
8041b7305e
|
[BugFix] Raise error when max_model_len is larger than KV cache (#2163)
|
2023-12-17 17:08:23 -08:00 |
|
Woosuk Kwon
|
c3372e87be
|
Remove dependency on CuPy (#2152)
|
2023-12-17 01:49:07 -08:00 |
|
Woosuk Kwon
|
37ca558103
|
Optimize model execution with CUDA graph (#1926)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
|
2023-12-16 21:12:08 -08:00 |
|
Yunfeng Bai
|
c06170cc8e
|
Add a flag to include stop string in output text (#1976)
|
2023-12-15 00:45:58 -08:00 |
|
Woosuk Kwon
|
464dd985e3
|
Fix num_gpus when TP > 1 (#1852)
|
2023-12-03 12:24:30 -08:00 |
|
Simon Mo
|
5313c2cb8b
|
Add Production Metrics in Prometheus format (#1890)
|
2023-12-02 16:37:44 -08:00 |
|
Woosuk Kwon
|
27feead2f8
|
Refactor Worker & InputMetadata (#1843)
|
2023-11-29 22:16:37 -08:00 |
|
FlorianJoncour
|
0229c386c5
|
Better integration with Ray Serve (#1821)
Co-authored-by: FlorianJoncour <florian@zetta-sys.com>
|
2023-11-29 13:25:43 -08:00 |
|
Zhuohan Li
|
708e6c18b0
|
[FIX] Fix class naming (#1803)
|
2023-11-28 14:08:01 -08:00 |
|
boydfd
|
4bb6b67188
|
fix RAM OOM when load large models in tensor parallel mode. (#1395)
Co-authored-by: ran_lin <rlin@thoughtworks.com>
|
2023-11-20 19:02:42 -08:00 |
|
Simon Mo
|
5ffc0d13a2
|
Migrate linter from pylint to ruff (#1665)
|
2023-11-20 11:58:01 -08:00 |
|
Simon Mo
|
cb08cd0d75
|
[Minor] Fix duplication of ignored seq group in engine step (#1666)
|
2023-11-16 13:11:41 -08:00 |
|
Dan Lord
|
7013a80170
|
Add support for spaces_between_special_tokens
|
2023-10-30 16:52:56 -07:00 |
|
Zhuohan Li
|
9d9072a069
|
Implement prompt logprobs & Batched topk for computing logprobs (#1328)
Co-authored-by: Yunmo Chen <16273544+wanmok@users.noreply.github.com>
|
2023-10-16 10:56:50 -07:00 |
|
Antoni Baum
|
acbed3ef40
|
Use monotonic time where appropriate (#1249)
|
2023-10-02 19:22:05 -07:00 |
|
Federico Cassano
|
66d18a7fb0
|
add support for tokenizer revision (#1163)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2023-10-02 19:19:46 -07:00 |
|
Woosuk Kwon
|
f936657eb6
|
Provide default max model length (#1224)
|
2023-09-28 14:44:02 -07:00 |
|
Chris Bamford
|
bb1ba58f06
|
[Mistral] Mistral-7B-v0.1 support (#1196)
Co-authored-by: timlacroix <t@mistral.ai>
|
2023-09-28 10:41:03 -07:00 |
|
Dan Lord
|
20f7cc4cde
|
Add skip_special_tokens sampling params (#1186)
|
2023-09-27 19:21:42 -07:00 |
|
Wang Ran (汪然)
|
30e775281d
|
fix typo (#1184)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2023-09-27 16:22:45 -07:00 |
|
Ricardo Lu
|
f98b745a81
|
feat: support stop_token_ids parameter. (#1097)
|
2023-09-21 15:34:02 -07:00 |
|
陈序
|
e21d7687a9
|
Fix hanging when prompt exceeds limit (#1029)
|
2023-09-17 01:48:56 -07:00 |
|
Woosuk Kwon
|
e3e79e9e8a
|
Implement AWQ quantization support for LLaMA (#1032)
Co-authored-by: Robert Irvine <robert@seamlessml.com>
Co-authored-by: root <rirv938@gmail.com>
Co-authored-by: Casper <casperbh.96@gmail.com>
Co-authored-by: julian-q <julianhquevedo@gmail.com>
|
2023-09-16 00:03:37 -07:00 |
|
Jasmond L
|
ab019eea75
|
Add Model Revision Support (#1014)
Co-authored-by: Jasmond Loh <Jasmond.Loh@hotmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2023-09-13 15:20:02 -07:00 |
|
Antoni Baum
|
9841d48a10
|
Use TGI-like incremental detokenization (#984)
|
2023-09-13 13:38:01 -07:00 |
|
Jingru
|
4042d192f5
|
fix "tansformers_module" ModuleNotFoundError when load model with trust_remote_code=True (#871)
|
2023-09-08 17:21:30 -07:00 |
|
Zhuohan Li
|
c957c741d9
|
Enable safetensors loading for all models (#974)
|
2023-09-07 15:49:52 -07:00 |
|
Zhuohan Li
|
002800f081
|
Align vLLM's beam search implementation with HF generate (#857)
|
2023-09-04 17:29:42 -07:00 |
|
Antoni Baum
|
ce741ba3e4
|
Refactor AsyncLLMEngine (#880)
|
2023-09-03 21:43:43 -07:00 |
|
Woosuk Kwon
|
55fe8a81ec
|
Refactor scheduler (#658)
|
2023-08-02 16:42:01 -07:00 |
|
Chaofan Lin
|
aa39e42c5a
|
fix doc (#622)
|
2023-07-31 13:11:57 -07:00 |
|
Fang li
|
953f28cf9a
|
fix ModuleNotFoundError (#599)
Co-authored-by: fangli <fangli@tencent.com>
|
2023-07-29 20:52:41 -07:00 |
|
Antoni Baum
|
9925c17940
|
Ray placement group support (#397)
|
2023-07-19 22:49:31 -07:00 |
|
Lily Liu
|
b4b195b360
|
fix max seq len (#489)
|
2023-07-17 23:20:20 -07:00 |
|
Zhuohan Li
|
2bdea7ac11
|
[Fix] Fix the condition of max_seq_len (#477)
|
2023-07-17 00:33:48 -04:00 |
|
xcnick
|
c6dfc3cdbe
|
Fix handling of special tokens in decoding. (#418)
|
2023-07-12 11:14:56 -04:00 |
|
codethazine
|
a945fcc2ae
|
Add trust-remote-code flag to handle remote tokenizers (#364)
|
2023-07-07 11:04:58 -07:00 |
|
Zhuohan Li
|
42e0c1df78
|
[Quality] Add CI for formatting (#343)
|
2023-07-03 14:50:56 -07:00 |
|
Zhuohan Li
|
d6fa1be3a8
|
[Quality] Add code formatter and linter (#326)
|
2023-07-03 11:31:55 -07:00 |
|
Lily Liu
|
dafd924c1f
|
Raise error for long prompt (#273)
|
2023-06-30 18:48:49 -07:00 |
|
Woosuk Kwon
|
998d9d1509
|
[Tokenizer] Add tokenizer mode (#298)
|
2023-06-28 14:19:22 -07:00 |
|
Woosuk Kwon
|
4338cc4750
|
[Tokenizer] Add an option to specify tokenizer (#284)
|
2023-06-28 09:46:58 -07:00 |
|
Zhuohan Li
|
0b7db411b5
|
[Bug] Fix the OOM condition for CPU cache (#260)
|
2023-06-26 11:16:13 -07:00 |
|
Zhuohan Li
|
1d24ccb96c
|
[Fix] Better error message when there is OOM during cache initialization (#203)
|
2023-06-22 15:30:06 +08:00 |
|
Zhuohan Li
|
2e0d314384
|
fix-ray (#193)
|
2023-06-22 00:21:41 +08:00 |
|
Woosuk Kwon
|
0b98ba15c7
|
Change the name to vLLM (#150)
|
2023-06-17 03:07:40 -07:00 |
|