Tyler Michael Smith
|
02cc3b51a7
|
[misc] benchmark_serving.py -- add ITL results and tweak TPOT results (#5263)
|
2024-06-05 10:17:51 -07:00 |
|
Simon Mo
|
d5b1eb081e
|
[CI] Add nightly benchmarks (#5260)
|
2024-06-05 09:42:08 -07:00 |
|
tomeras91
|
f0a500545f
|
[Frontend] OpenAI API server: Add add_special_tokens to ChatCompletionRequest (default False) (#5278)
|
2024-06-05 09:32:58 -07:00 |
|
Woosuk Kwon
|
c65146e75e
|
[Misc] Fix docstring of get_attn_backend (#5271)
|
2024-06-05 09:18:59 -07:00 |
|
Woosuk Kwon
|
41ca62cf03
|
[Misc] Add CustomOp interface for device portability (#5255)
|
2024-06-05 09:18:19 -07:00 |
|
zifeitong
|
974fc9b845
|
[Bugfix] Fix prompt_logprobs when SamplingParams.detokenize is set to True (#5226)
|
2024-06-04 19:37:28 -07:00 |
|
youkaichao
|
fee4dcc33a
|
[Misc] update collect env (#5261)
|
2024-06-04 17:29:09 -05:00 |
|
Michael Goin
|
650a4cc55e
|
[Misc] Add transformers version to collect_env.py (#5259)
|
2024-06-04 12:52:28 -07:00 |
|
Simon Mo
|
9ca62d8668
|
[CI] mark AMD test as softfail to prevent blockage (#5256)
|
2024-06-04 11:34:53 -07:00 |
|
Li, Jiang
|
45c35f0d58
|
[CI/Build] Reducing CPU CI execution time (#5241)
|
2024-06-04 10:26:40 -07:00 |
|
Cyrus Leung
|
9ba093b4f4
|
[CI/Build] Simplify model loading for HfRunner (#5251)
|
2024-06-04 10:09:19 -07:00 |
|
Woosuk Kwon
|
27208be66e
|
[Kernel] Add back batch size 1536 and 3072 to MoE tuning (#5242)
|
2024-06-04 09:58:47 -07:00 |
|
Jie Fu (傅杰)
|
87d5abef75
|
[Bugfix] Fix a bug caused by pip install setuptools>=49.4.0 for CPU backend (#5249)
|
2024-06-04 09:57:51 -07:00 |
|
Cyrus Leung
|
ec784b2526
|
[CI/Build] Add inputs tests (#5215)
|
2024-06-03 21:01:46 -07:00 |
|
zifeitong
|
a58f24e590
|
[Bugfix] Fix torch.compile() error when using MultiprocessingGPUExecutor (#5229)
|
2024-06-03 20:55:50 -07:00 |
|
afeldman-nm
|
f42a006b15
|
[Bugfix]: During testing, use pytest monkeypatch for safely overriding the env var that indicates the vLLM backend (#5210)
|
2024-06-03 20:32:57 -07:00 |
|
Woosuk Kwon
|
3a434b07ed
|
[Kernel] Enhance MoE benchmarking & tuning script (#4921)
|
2024-06-03 20:06:59 -07:00 |
|
Zhuohan Li
|
bd0e7802e0
|
[Bugfix] Add warmup for prefix caching example (#5235)
|
2024-06-03 19:36:41 -07:00 |
|
Toshiki Kataoka
|
06b2550cbb
|
[Bugfix] Support prompt_logprobs==0 (#5217)
|
2024-06-03 17:59:30 -07:00 |
|
Breno Faria
|
f775a07e30
|
[FRONTEND] OpenAI tools support named functions (#5032)
|
2024-06-03 18:25:29 -05:00 |
|
Kevin H. Luu
|
4f0d17c05c
|
New CI template on AWS stack (#5110)
Signed-off-by: kevin <kevin@anyscale.com>
|
2024-06-03 16:16:43 -07:00 |
|
Kaiyang Chen
|
10c38e3e46
|
[Misc]: Implement CPU/GPU swapping in BlockManagerV2 (#3834)
|
2024-06-03 13:37:11 -07:00 |
|
Yuan
|
cafb8e06c5
|
[CI/BUILD] enable intel queue for longer CPU tests (#4113)
|
2024-06-03 10:39:50 -07:00 |
|
Tyler Michael Smith
|
cbb2f59cc8
|
[Kernel] Pass a device pointer into the quantize kernel for the scales (#5159)
|
2024-06-03 09:52:30 -07:00 |
|
Antoni Baum
|
0ab278ca31
|
[Core] Remove unnecessary copies in flash attn backend (#5138)
|
2024-06-03 09:39:31 -07:00 |
|
Cyrus Leung
|
7a64d24aad
|
[Core] Support image processor (#4197)
|
2024-06-02 22:56:41 -07:00 |
|
Cyrus Leung
|
dfbe60dc62
|
[Misc] Simplify code and fix type annotations in conftest.py (#5118)
|
2024-06-02 16:05:50 -07:00 |
|
Divakar Verma
|
a66cf40b20
|
[Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer (#4927)
This PR enables the fused topk_softmax kernel used in moe layer for HIP
|
2024-06-02 14:13:26 -07:00 |
|
Avinash Raj
|
f790ad3c50
|
[Frontend][OpenAI] Support for returning max_model_len on /v1/models response (#4643)
|
2024-06-02 08:06:13 +00:00 |
|
Simon Mo
|
ed59a7ed23
|
Update test_ignore_eos (#4898)
|
2024-06-02 02:21:53 +00:00 |
|
Robert Shaw
|
044793d8df
|
[BugFix] Prevent LLM.encode for non-generation Models (#5184)
Co-authored-by: mgoin <michael@neuralmagic.com>
|
2024-06-01 23:35:41 +00:00 |
|
Daniil Arapov
|
c2d6d2f960
|
[Bugfix]: Fix issues related to prefix caching example (#5177) (#5180)
|
2024-06-01 15:53:52 -07:00 |
|
Zhuohan Li
|
8279078e21
|
[Bugfix] Remove deprecated @abstractproperty (#5174)
|
2024-06-01 22:40:25 +00:00 |
|
chenqianfzh
|
b9c0605a8e
|
[Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776)
|
2024-06-01 14:51:10 -06:00 |
|
Nadav Shmayovits
|
37464a0f74
|
[Bugfix] Fix call to init_logger in openai server (#4765)
|
2024-06-01 17:18:50 +00:00 |
|
Ye Cao
|
c354072828
|
[Minor] Fix the path typo in loader.py: save_sharded_states.py -> save_sharded_state.py (#5151)
Signed-off-by: Ye Cao <caoye.cao@alibaba-inc.com>
|
2024-06-01 17:11:22 +00:00 |
|
Varun Sundar Rabindranath
|
f081c3ce4b
|
[Kernel] Update Cutlass fp8 configs (#5144)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
|
2024-06-01 08:46:07 +00:00 |
|
Tyler Michael Smith
|
260d119e86
|
[Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU (#5137)
|
2024-06-01 06:45:32 +00:00 |
|
Daniele
|
a360ff80bb
|
[CI/Build] CMakeLists: build all extensions' cmake targets at the same time (#5034)
|
2024-05-31 22:06:45 -06:00 |
|
Tyler Michael Smith
|
1197e02141
|
[Build] Guard against older CUDA versions when building CUTLASS 3.x kernels (#5168)
|
2024-05-31 17:21:38 -07:00 |
|
Nick Hill
|
657579113f
|
[Doc] Add checkmark for GPTBigCodeForCausalLM LoRA support (#5171)
|
2024-05-31 17:20:19 -07:00 |
|
Cody Yu
|
e9899fb7a4
|
[Model] Enable FP8 QKV in MoE and refine kernel tuning script (#5039)
|
2024-05-31 14:29:19 -07:00 |
|
functionxu123
|
a377f0bd5e
|
[Misc]: optimize eager mode host time (#4196)
Co-authored-by: xuhao <xuhao@cambricon.com>
|
2024-05-31 13:14:50 +08:00 |
|
Simon Mo
|
e9d3aa04f6
|
Revert "[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5)" (#5149)
|
2024-05-30 22:00:26 -07:00 |
|
SnowDist
|
a22dea54d3
|
[Model] Support MAP-NEO model (#5081)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2024-05-30 19:24:41 -07:00 |
|
simon-mo
|
533c217792
|
Fix cutlass sm_90a vesrion in CMakeList
|
2024-05-31 02:13:01 +00:00 |
|
Alexander Matveev
|
6d21fa1cad
|
[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5) (#5136)
|
2024-05-30 21:02:11 -05:00 |
|
Robert Shaw
|
b35be5403f
|
[Bugfix] Avoid Warnings in SparseML Activation Quantization (#5120)
|
2024-05-30 17:04:37 -07:00 |
|
Simon Mo
|
45a1a69b98
|
[Build] Disable sm_90a in cu11 (#5141)
|
2024-05-30 14:37:16 -07:00 |
|
Simon Mo
|
87a658c812
|
Bump version to v0.4.3 (#5046)
|
2024-05-30 11:13:46 -07:00 |
|