171 Commits

Author SHA1 Message Date
Kuntai Du
81ede99ca4
[Core] Deprecating block manager v1 and make block manager v2 default (#8704)
Removing the block manager v1. This is the initial piece of prefix-caching-centric design. In order to achieve prefix-caching-centric design, we need to simplify the code path so that we only use v2 block manager (which has much higher performance on prefix caching).
2024-10-17 11:38:15 -05:00
Cyrus Leung
7e7eae338d
[Misc] Standardize RoPE handling for Qwen2-VL (#9250) 2024-10-16 13:56:17 +08:00
Grace Ho
5d264f4ab8
pass ignore_eos parameter to all benchmark_serving calls (#9349) 2024-10-15 13:30:44 -07:00
Andy Dai
94bf9ae4e9
[Misc] Fix sampling from sonnet for long context case (#9235) 2024-10-11 00:33:16 +00:00
sroy745
f3a507f1d3
[Core] Add an environment variable which needs to be set explicitly to allow BlockSpaceManagerV1 (#9149) 2024-10-10 14:17:17 +08:00
youkaichao
18b296fdb2
[core] remove beam search from the core (#9105) 2024-10-07 05:47:04 +00:00
Brendan Wong
168cab6bbf
[Frontend] API support for beam search (#9087)
Co-authored-by: youkaichao <youkaichao@126.com>
2024-10-05 23:39:03 -07:00
Cody Yu
27302dd584
[Misc] Fix CI lint (#9085) 2024-10-04 16:07:54 -07:00
Andy Dai
0cc566ca8f
[Misc] Add random seed for prefix cache benchmark (#9081) 2024-10-04 21:58:57 +00:00
Kuntai Du
fbb74420e7
[CI] Update performance benchmark: upgrade trt-llm to r24.07, and add SGLang (#7412) 2024-10-04 14:01:44 -07:00
vlsav
22f5851b80
Update benchmark_serving.py to read and write json-datasets, results in UTF8, for better compatibility with Windows (#8997) 2024-10-01 11:07:06 -07:00
Chen Zhang
e585b583a9
[Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1 (#8891) 2024-09-28 18:51:22 +00:00
Peter Pan
0e088750af
[MISC] Fix invalid escape sequence '\' (#8830)
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
2024-09-27 01:13:25 -07:00
Cyrus Leung
3b00b9c26c
[Core] renamePromptInputs and inputs (#8876) 2024-09-26 20:35:15 -07:00
Simon Mo
4f1ba0844b
Revert "rename PromptInputs and inputs with backward compatibility (#8760) (#8810) 2024-09-25 10:36:26 -07:00
Cyrus Leung
28e1299e60
rename PromptInputs and inputs with backward compatibility (#8760) 2024-09-25 09:36:47 -07:00
Archit Patke
6da1ab6b41
[Core] Adding Priority Scheduling (#5958) 2024-09-24 19:50:50 -07:00
Simon Mo
3185fb0cca
Revert "[Core] Rename PromptInputs to PromptType, and inputs to prompt" (#8750) 2024-09-24 05:45:20 +00:00
youkaichao
0250dd68c5
re-implement beam search on top of vllm core (#8726)
Co-authored-by: Brendan Wong <bjwpokemon@gmail.com>
2024-09-23 22:08:12 -07:00
Lucas Wilkinson
86e9c8df29
[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (#7701)
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-09-23 13:46:26 -04:00
Cyrus Leung
0057894ef7
[Core] Rename PromptInputs and inputs(#8673) 2024-09-20 19:00:54 -07:00
Kunshang Ji
855c8ae2c9
[MISC] remove engine_use_ray in benchmark_throughput.py (#8615) 2024-09-18 22:33:20 -07:00
Kuntai Du
c52ec5f034
[Bugfix] fixing sonnet benchmark bug in benchmark_serving.py (#8616) 2024-09-19 05:24:24 +00:00
Aaron Pham
9d104b5beb
[CI/Build] Update Ruff version (#8469)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-09-18 11:00:56 +00:00
Cyrus Leung
6ffa3f314c
[CI/Build] Avoid CUDA initialization (#8534) 2024-09-18 10:38:11 +00:00
Isotr0py
1b6de8352b
[Benchmark] Support sample from HF datasets and image input for benchmark_serving (#8495) 2024-09-17 07:34:27 +00:00
Aarni Koskela
8baa454937
[Misc] Move device options to a single place (#8322) 2024-09-11 13:25:58 -07:00
Wei-Sheng Chin
795b662cff
Enable Random Prefix Caching in Serving Profiling Tool (benchmark_serving.py) (#8241) 2024-09-06 20:18:16 -07:00
afeldman-nm
e5cab71531
[Frontend] Add --logprobs argument to benchmark_serving.py (#8191) 2024-09-06 09:01:14 -07:00
Cody Yu
77d9e514a2
[MISC] Replace input token throughput with total token throughput (#8164)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-09-04 20:23:22 +00:00
Nick Hill
d4db9f53c8
[Benchmark] Add --async-engine option to benchmark_throughput.py (#7964) 2024-09-03 20:57:41 -04:00
Wei-Sheng Chin
0c785d344d
Add more percentiles and latencies (#7759) 2024-08-29 16:48:11 -07:00
Philipp Schmid
345be0e244
[benchmark] Update TGI version (#7917) 2024-08-27 15:07:53 -07:00
Megha Agarwal
2eedede875
[Core] Asynchronous Output Processor (#7049)
Co-authored-by: Alexander Matveev <alexm@neuralmagic.com>
2024-08-26 20:53:20 -07:00
Alexander Matveev
9db93de20c
[Core] Add multi-step support to LLMEngine (#7789) 2024-08-23 12:45:53 -07:00
Jiaxin Shan
d3b5b98021
[Misc] Enhance prefix-caching benchmark tool (#6568) 2024-08-22 09:32:02 -07:00
Luka Govedič
7937009a7e
[Kernel] Replaced blockReduce[...] functions with cub::BlockReduce (#7233)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-21 20:18:00 -04:00
William Lin
dd53c4b023
[misc] Add Torch profiler support (#7451)
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-08-21 15:39:26 -07:00
Lucas Wilkinson
5288c06aa0
[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel (#7174) 2024-08-20 07:09:33 -06:00
Mor Zusman
7fc23be81c
[Kernel] W8A16 Int8 inside FusedMoE (#7415) 2024-08-16 10:06:51 -07:00
Roger Wang
70d268a399
[Bugfix] Fix ITL recording in serving benchmark (#7372) 2024-08-09 10:00:00 -07:00
Luka Govedič
8d59dbb000
[Kernel] Add per-tensor and per-token AZP epilogues (#5941)
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-08-06 18:17:08 +00:00
Lucas Wilkinson
a8d604ca2a
[Misc] Disambiguate quantized types via a new ScalarType (#6396) 2024-08-02 13:51:58 -07:00
Varun Sundar Rabindranath
35e9c12bfa
[Kernel] Tuned int8 Cutlass Kernels for SM75 (T4) (#6996)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-07-31 14:40:32 -07:00
Varun Sundar Rabindranath
766435e660
[Kernel] Tuned FP8 Kernels for Ada Lovelace (#6677)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-07-29 09:42:35 -06:00
Alexander Matveev
75acdaa4b6
[Kernel] Increase precision of GPTQ/AWQ Marlin kernel (#6795) 2024-07-27 17:52:33 -04:00
Joe
14dbd5a767
[Model] H2O Danube3-4b (#6451) 2024-07-26 20:47:50 -07:00
Cyrus Leung
739b61a348
[Frontend] Refactor prompt processing (#4028)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-22 10:13:53 -07:00
Woosuk Kwon
a9a2e74d21
[Misc] Use torch.Tensor for type annotation (#6505) 2024-07-17 13:01:10 +00:00
Michael Goin
978aed5300
[Kernel][Attention] Separate Attention.kv_scale into k_scale and v_scale (#6081) 2024-07-16 15:31:32 -07:00