1190 Commits

Author SHA1 Message Date
Harry Mellor
682789d402
Fix missing docs and out of sync EngineArgs (#4219)
Co-authored-by: Harry Mellor <hmellor@oxts.com>
2024-04-19 20:51:33 -07:00
Ayush Rautwar
138485a82d
[Bugfix] Add fix for JSON whitespace (#4189)
Co-authored-by: Ubuntu <ubuntu@ip-172-31-13-147.ec2.internal>
2024-04-19 20:49:22 -07:00
Chirag Jain
bc9df1571b
Pass tokenizer_revision when getting tokenizer in openai serving (#4214) 2024-04-19 17:13:56 -07:00
youkaichao
15b86408a8
[Misc] add nccl in collect env (#4211) 2024-04-19 19:44:51 +00:00
Ronen Schaffer
7be4f5628f
[Bugfix][Core] Restore logging of stats in the async engine (#4150) 2024-04-19 08:08:26 -07:00
Uranus
8f20fc04bf
[Misc] fix docstrings (#4191)
Co-authored-by: Zhong Wang <wangzhong@infini-ai.com>
2024-04-19 08:18:33 +00:00
Simon Mo
221d93ecbf
Bump version of 0.4.1 (#4177) 2024-04-19 01:00:22 -07:00
Jee Li
d17c8477f1
[Bugfix] Fix LoRA loading check (#4138)
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-04-19 00:59:54 -07:00
Simon Mo
a134ef6f5e
Support eos_token_id from generation_config.json (#4182) 2024-04-19 04:13:36 +00:00
youkaichao
8a7a3e4436
[Core] add an option to log every function call to for debugging hang/crash in distributed inference (#4079)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-04-18 16:15:12 -07:00
Adam Tilghman
8f9c28fd40
[Bugfix] Fix CustomAllreduce nvlink topology detection (#3974)
[Bugfix] Fix CustomAllreduce pcie nvlink topology detection (#3974) (#4159)
2024-04-18 15:32:47 -07:00
Liangfu Chen
cd2f63fb36
[CI/CD] add neuron docker and ci test scripts (#3571) 2024-04-18 15:26:01 -07:00
Nick Hill
87fa80c91f
[Misc] Bump transformers to latest version (#4176) 2024-04-18 14:36:39 -07:00
James Whedbee
e1bb2fd52d
[Bugfix] Support logprobs when using guided_json and other constrained decoding fields (#4149) 2024-04-18 21:12:55 +00:00
Simon Mo
705578ae14
[Docs] document that Meta Llama 3 is supported (#4175) 2024-04-18 10:55:48 -07:00
Michał Moskal
e8cc7967ff
[Bugfix][Kernel] allow non-power-of-two head sizes in prefix prefill (#4128) 2024-04-18 00:51:28 -07:00
Michael Goin
53b018edcb
[Bugfix] Get available quantization methods from quantization registry (#4098) 2024-04-18 00:21:55 -07:00
Harry Mellor
66ded03067
Allow model to be served under multiple names (#2894)
Co-authored-by: Alexandre Payot <alexandrep@graphcore.ai>
2024-04-18 00:16:26 -07:00
youkaichao
6dc1fc9cfe
[Core] nccl integrity check and test (#4155)
[Core] Add integrity check during initialization; add test for it (#4155)
2024-04-17 22:28:52 -07:00
SangBin Cho
533d2a1f39
[Typing] Mypy typing part 2 (#4043)
Co-authored-by: SangBin Cho <sangcho@sangcho-LT93GQWG9C.local>
2024-04-17 17:28:43 -07:00
Shoichi Uchinami
a53222544c
[Kernel] Add punica dimension for Swallow-MS-7B LoRA (#4134) 2024-04-17 10:02:45 -07:00
Elinx
fe3b5bbc23
[Bugfix] fix output parsing error for trtllm backend (#4137)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-04-17 11:07:23 +00:00
youkaichao
8438e0569e
[Core] RayWorkerVllm --> WorkerWrapper to reduce duplication (#4024)
[Core] replace narrow-usage RayWorkerVllm to general WorkerWrapper to reduce code duplication (#4024)
2024-04-17 08:34:33 +00:00
Cade Daniel
11d652bd4f
[CI] Move CPU/AMD tests to after wait (#4123) 2024-04-16 22:53:26 -07:00
Cade Daniel
d150e4f89f
[Misc] [CI] Fix CI failure caught after merge (#4126) 2024-04-16 17:56:01 -07:00
Cade Daniel
e95cd87959
[Speculative decoding 6/9] Integrate speculative decoding with LLMEngine (#3894) 2024-04-16 13:09:21 -07:00
Antoni Baum
69e1d2fb69
[Core] Refactor model loading code (#4097) 2024-04-16 11:34:39 -07:00
Noam Gat
05434764cd
LM Format Enforcer Guided Decoding Support (#3868)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-04-16 05:54:57 +00:00
SangBin Cho
4e7ee664e2
[Core] Fix engine-use-ray broken (#4105) 2024-04-16 05:24:53 +00:00
SangBin Cho
37e84a403d
[Typing] Fix Sequence type GenericAlias only available after Python 3.9. (#4092) 2024-04-15 14:47:31 -07:00
Ricky Xu
4695397dcf
[Bugfix] Fix ray workers profiling with nsight (#4095) 2024-04-15 14:24:45 -07:00
Sanger Steel
d619ae2d19
[Doc] Add better clarity for tensorizer usage (#4090)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-04-15 13:28:25 -07:00
Nick Hill
eb46fbfda2
[Core] Simplifications to executor classes (#4071) 2024-04-15 13:05:09 -07:00
Li, Jiang
0003e9154b
[Misc][Minor] Fix CPU block num log in CPUExecutor. (#4088) 2024-04-15 08:35:55 -07:00
Zhuohan Li
e11e200736
[Bugfix] Fix filelock version requirement (#4075) 2024-04-14 21:50:08 -07:00
Roy
8db1bf32f8
[Misc] Upgrade triton to 2.2.0 (#4061) 2024-04-14 17:43:54 -07:00
Simon Mo
aceb17cf2d
[Docs] document that mixtral 8x22b is supported (#4073) 2024-04-14 14:35:55 -07:00
Nick Hill
563c54f760
[BugFix] Fix tensorizer extra in setup.py (#4072) 2024-04-14 14:12:42 -07:00
youkaichao
2cd6b4f362
[Core] avoid too many cuda context by caching p2p test (#4021) 2024-04-13 23:40:21 -07:00
Sanger Steel
711a000255
[Frontend] [Core] feat: Add model loading using tensorizer (#3476) 2024-04-13 17:13:01 -07:00
Jee Li
989ae2538d
[Kernel] Add punica dimension for Baichuan-13B (#4053) 2024-04-13 07:55:05 -07:00
zspo
0a430b4ae2
[Bugfix] fix_small_bug_in_neuron_executor (#4051) 2024-04-13 07:54:03 -07:00
zspo
ec8e3c695f
[Bugfix] fix_log_time_in_metrics (#4050) 2024-04-13 07:52:36 -07:00
youkaichao
98afde19fc
[Core][Distributed] improve logging for init dist (#4042) 2024-04-13 07:12:53 -07:00
Dylan Hawk
5c2e66e487
[Bugfix] More type hint fixes for py 3.8 (#4039) 2024-04-12 21:07:04 -07:00
youkaichao
546e721168
[CI/Test] expand ruff and yapf for all supported python version (#4037) 2024-04-13 01:43:37 +00:00
Jee Li
b8aacac31a
[Bugfix] Fix LoRA bug (#4032) 2024-04-12 16:56:37 -07:00
Bellk17
d04973ad54
Fix triton compilation issue (#3984)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-04-12 16:41:26 -07:00
youkaichao
fbb9d9eef4
[Core] fix custom allreduce default value (#4040) 2024-04-12 16:40:39 -07:00
SangBin Cho
09473ee41c
[mypy] Add mypy type annotation part 1 (#4006) 2024-04-12 14:35:50 -07:00