Nicolò Lucchesi
6116ca8cd7
[Feature] [Spec decode]: Enable MLPSpeculator/Medusa and prompt_logprobs
with ChunkedPrefill ( #10132 )
...
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: wallashss <wallashss@ibm.com>
Co-authored-by: wallashss <wallashss@ibm.com>
2025-01-27 13:38:35 -08:00
Murali Andoorveedu
db66e018ea
[Bugfix] Fix for Spec model TP + Chunked Prefill ( #10232 )
...
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com>
Signed-off-by: Sourashis Roy <sroy@roblox.com>
Co-authored-by: Sourashis Roy <sroy@roblox.com>
2024-11-26 09:11:16 -08:00
wangshuai09
4e2d95e372
[Hardware][ROCM] using current_platform.is_rocm ( #9642 )
...
Signed-off-by: wangshuai09 <391746016@qq.com>
2024-10-28 04:07:00 +00:00
Kuntai Du
81ede99ca4
[Core] Deprecating block manager v1 and make block manager v2 default ( #8704 )
...
Removing the block manager v1. This is the initial piece of prefix-caching-centric design. In order to achieve prefix-caching-centric design, we need to simplify the code path so that we only use v2 block manager (which has much higher performance on prefix caching).
2024-10-17 11:38:15 -05:00
Lily Liu
775f00f81e
[Speculative Decoding] Test refactor ( #8317 )
...
Co-authored-by: youkaichao <youkaichao@126.com>
2024-09-11 14:07:34 -07:00
Thomas Parnell
a5314e8698
[Model] RowParallelLinear: pass bias to quant_method.apply ( #6327 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-07-19 07:15:22 -06:00
Sirej Dua
15aba081f3
[Speculative Decoding] MLPSpeculator Tensor Parallel support (1/2) ( #6050 )
...
Co-authored-by: Sirej Dua <sirej.dua@databricks.com>
Co-authored-by: Sirej Dua <Sirej Dua>
2024-07-02 07:20:29 -07:00
Woo-Yeon Lee
2ce5d6688b
[Speculative Decoding] Support draft model on different tensor-parallel size than target model ( #5414 )
2024-06-25 09:56:06 +00:00