977 Commits

Author SHA1 Message Date
SangBin Cho
b51c1cc9d2
[2/N] Chunked prefill data update (#3538) 2024-03-28 10:06:01 -07:00
Roger Wang
ce567a2926
[Kernel] DBRX Triton MoE kernel H100 (#3692) 2024-03-28 10:05:34 -07:00
wenyujin333
d6ea427f04
[Model] Add support for Qwen2MoeModel (#3346) 2024-03-28 15:19:59 +00:00
Cade Daniel
14ccd94c89
[Core][Bugfix]Refactor block manager for better testability (#3492) 2024-03-27 23:59:28 -07:00
Woosuk Kwon
8267b06c30
[Kernel] Add Triton MoE kernel configs for DBRX on A100 (#3679) 2024-03-27 22:22:25 -07:00
youkaichao
3492859b68
[CI/Build] update default number of jobs and nvcc threads to avoid overloading the system (#3675) 2024-03-28 00:18:54 -04:00
hxer7963
098e1776ba
[Model] Add support for xverse (#3610)
Co-authored-by: willhe <hexin@xverse.cn>
Co-authored-by: root <root@localhost.localdomain>
2024-03-27 18:12:54 -07:00
Roy
10e6322283
[Model] Fix and clean commandr (#3671) 2024-03-28 00:20:00 +00:00
Woosuk Kwon
6d9aa00fc4
[Docs] Add Command-R to supported models (#3669) 2024-03-27 15:20:00 -07:00
zeppombal
1182607e18
Add support for Cohere's Command-R model (#3433)
Co-authored-by: José Maria Pombal <jose.pombal@unbabel.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-03-27 14:19:32 -07:00
Roger Wang
45b6ef6513
feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark (#3277) 2024-03-27 13:39:26 -07:00
AmadeusChan
1956931436
[Misc] add the "download-dir" option to the latency/throughput benchmarks (#3621) 2024-03-27 13:39:05 -07:00
Megha Agarwal
e24336b5a7
[Model] Add support for DBRX (#3660) 2024-03-27 13:01:46 -07:00
youkaichao
d18f4e73f3
[Bugfix] [Hotfix] fix nccl library name (#3661) 2024-03-27 17:23:54 +00:00
Woosuk Kwon
82c540bebf
[Bugfix] More faithful implementation of Gemma (#3653) 2024-03-27 09:37:18 -07:00
youkaichao
8f44facddd
[Core] remove cupy dependency (#3625) 2024-03-27 00:33:26 -07:00
Woosuk Kwon
e66b629c04
[Misc] Minor fix in KVCache type (#3652) 2024-03-26 23:14:06 -07:00
Jee Li
76879342a3
[Doc]add lora support (#3649) 2024-03-27 02:06:46 +00:00
Jee Li
566b57c5c4
[Kernel] support non-zero cuda devices in punica kernels (#3636) 2024-03-27 00:37:42 +00:00
Nick Hill
0dc72273b8
[BugFix] Fix ipv4 address parsing regression (#3645) 2024-03-26 14:39:44 -07:00
liiliiliil
a979d9771e
[Bugfix] Fix ipv6 address parsing bug (#3641) 2024-03-26 11:58:20 -07:00
Jee Li
8af890a865
Enable more models to inference based on LoRA (#3382)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-03-25 18:09:31 -07:00
Nick Hill
dfeb2ecc3a
[Misc] Include matched stop string/token in responses (#2976)
Co-authored-by: Sahil Suneja <sahilsuneja@gmail.com>
2024-03-25 17:31:32 -07:00
Antoni Baum
3a243095e5
Optimize _get_ranks in Sampler (#3623) 2024-03-25 16:03:02 -07:00
xwjiang2010
64172a976c
[Feature] Add vision language model support. (#3042) 2024-03-25 14:16:30 -07:00
Simon Mo
f408d05c52
hotfix isort on logprobs ranks pr (#3622) 2024-03-25 11:55:46 -07:00
Dylan Hawk
0b4997e05c
[Bugfix] API stream returning two stops (#3450)
Co-authored-by: Dylan Hawk <dylanwawk@gmail.com>
2024-03-25 10:14:34 -07:00
Travis Johnson
c13ad1b7bd
feat: implement the min_tokens sampling parameter (#3124)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-03-25 10:14:26 -07:00
Swapnil Parekh
819924e749
[Core] Adding token ranks along with logprobs (#3516)
Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>
2024-03-25 10:13:10 -07:00
SangBin Cho
01bfb22b41
[CI] Try introducing isort. (#3495) 2024-03-25 07:59:47 -07:00
TianYu GUO
e67c295b0c
[Bugfix] fix automatic prefix args and add log info (#3608) 2024-03-25 05:35:22 -07:00
Woosuk Kwon
925f3332ca
[Core] Refactor Attention Take 2 (#3462) 2024-03-25 04:39:33 +00:00
少年
b0dfa91dd7
[Model] Add starcoder2 awq support (#3569) 2024-03-24 21:07:36 -07:00
Woosuk Kwon
56a8652f33
[Bugfix] store lock file in tmp directory (#3578)" (#3599)
Co-authored-by: youkaichao <youkaichao@126.com>
2024-03-24 20:06:50 -07:00
Kunshang Ji
6d93d35308
[BugFix] tensor.get_device() -> tensor.device (#3604) 2024-03-24 19:01:13 -07:00
youkaichao
837e185142
[CI/Build] fix flaky test (#3602) 2024-03-24 17:43:05 -07:00
youkaichao
42bc386129
[CI/Build] respect the common environment variable MAX_JOBS (#3600) 2024-03-24 17:04:00 -07:00
youkaichao
8b268a46a7
[CI] typo fix: is_hip --> is_hip() (#3595) 2024-03-24 16:03:06 -07:00
Nick Hill
41deac4a3d
[BugFix] 1D query fix for MoE models (#3597) 2024-03-24 16:00:16 -07:00
Woosuk Kwon
af9e53496f
[BugFix] Fix Falcon tied embeddings (#3590)
Co-authored-by: 44670 <44670@users.noreply.github.com>
2024-03-24 06:34:01 -07:00
Roger Wang
f8a12ecc7f
[Misc] Bump transformers version (#3592) 2024-03-24 06:32:45 -07:00
Woosuk Kwon
3c5ab9b811
[Misc] Fix BLOOM copyright notice (#3591) 2024-03-23 23:30:56 -07:00
kota-iizuka
743a0b7402
[Bugfix] use SoftLockFile instead of LockFile (#3578) 2024-03-23 11:43:11 -07:00
Antoni Baum
bfdb1ba5c3
[Core] Improve detokenization performance for prefill (#3469)
Co-authored-by: MeloYang <meloyang05@gmail.com>
2024-03-22 13:44:12 -07:00
Thomas Parnell
cf2f084d56
Dynamic scheduler delay to improve ITL performance (#3279)
Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com>
2024-03-22 12:28:14 -07:00
Hanzhi Zhou
f721096d48
[BugFix] Some fixes for custom allreduce kernels (#2760) 2024-03-21 23:02:58 -07:00
Zhuohan Li
e90fc21f2e
[Hardware][Neuron] Refactor neuron support (#3471) 2024-03-22 01:22:17 +00:00
Roy
ea5f14e6ff
[Bugfix][Model] Fix Qwen2 (#3554) 2024-03-22 00:18:58 +00:00
Taemin Lee
b7050ca7df
[BugFix] gemma loading after quantization or LoRA. (#3553) 2024-03-21 13:16:57 -07:00
Woosuk Kwon
c188ecb080
[Misc] Bump up transformers to v4.39.0 & Remove StarCoder2Config (#3551)
Co-authored-by: Roy <jasonailu87@gmail.com>
Co-authored-by: Roger Meier <r.meier@siemens.com>
2024-03-21 07:58:12 -07:00