20231088/vllm - vllm - Luminance Code Repo

20231088/vllm

Author	SHA1	Message	Date
Jie Fu (傅杰)	cd9c0d65d9	[Hardware][Intel] Support CPU inference with AVX2 ISA (#5452 )	2024-06-13 17:22:24 -06:00
Antoni Baum	50eed24d25	Add `cuda_device_count_stateless` (#5473 )	2024-06-13 16:06:49 -07:00
Tyler Michael Smith	e38042d4af	[Kernel] Disable CUTLASS kernels for fp8 (#5505 )	2024-06-13 13:38:05 -07:00
Tyler Michael Smith	33e3b37242	[CI/Build] Disable test_fp8.py (#5508 )	2024-06-13 13:37:48 -07:00
youkaichao	1696efe6c9	[misc] fix format.sh (#5511 )	2024-06-13 12:09:16 -07:00
Antoni Baum	6b0511a57b	Revert "[Core] Remove unnecessary copies in flash attn backend" (#5478 )	2024-06-13 11:22:50 -07:00
Antoni Baum	a8fda4f661	Seperate dev requirements into lint and test (#5474 )	2024-06-13 11:22:41 -07:00
Cody Yu	30299a41fa	[MISC] Remove FP8 warning (#5472 ) Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>	2024-06-13 11:22:30 -07:00
Tyler Michael Smith	85657b5607	[Kernel] Factor out epilogues from cutlass kernels (#5391 ) Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: zifeitong <zifei.tong@parasail.io> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-06-13 11:22:19 -07:00
Cyrus Leung	0ce7b952f8	[Doc] Update LLaVA docs (#5437 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-06-13 11:22:07 -07:00
Cyrus Leung	39873476f8	[CI/Build] Simplify OpenAI server setup in tests (#5100 )	2024-06-13 11:21:53 -07:00
Cyrus Leung	03dccc886e	[Misc] Add vLLM version getter to utils (#5098 )	2024-06-13 11:21:39 -07:00
Woosuk Kwon	a65634d3ae	[Docs] Add 4th meetup slides (#5509 )	2024-06-13 10:18:26 -07:00
Li, Jiang	80aa7e91fc	[Hardware][Intel] Optimize CPU backend and add more performance tips (#4971 ) Co-authored-by: Jianan Gu <jianan.gu@intel.com>	2024-06-13 09:33:14 -07:00
wenyujin333	bd43973522	[Kernel] Tune Qwen2MoE kernel configurations with tp2,4 (#5497 ) Tune Qwen2-57B-A14B configs based on #4921 Throughput Performance command: python benchmarks/benchmark_throughput.py --model=Qwen/Qwen2-57B-A14B-Instruct --input-len 1000 --output-len 50 -tp 2 A100 GPU benchmark no config w/ PR tp=2 10.53 requests/s, 11058.17 tokens/s 12.47 requests/s, 13088.57 tokens/s tp=4 17.77 requests/s, 18662.95 tokens/s 20.20 requests/s, 21212.32 tokens/s	2024-06-13 09:01:10 -07:00
Michael Goin	23ec72fa03	[CI/Build][REDO] Add is_quant_method_supported to control quantization test configurations (#5466 )	2024-06-13 15:18:08 +00:00
Dipika Sikka	c2637a613b	[Kernel] `w4a16` support for `compressed-tensors` (#5385 ) Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-06-13 10:19:56 -04:00
Wang, Yi	88407532e7	[Bugfix]if the content is started with ":"(response of ping), client should i… (#5303 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-06-12 20:16:41 -07:00
Kevin H. Luu	916d219d62	[ci] Use sccache to build images (#5419 ) Signed-off-by: kevin <kevin@anyscale.com>	2024-06-12 17:58:12 -07:00
youkaichao	ea3890a5f0	[Core][Distributed] code deduplication in tp&pp with coordinator(#5293 ) [Core][Distributed] add coordinator to reduce code duplication in tp and pp (#5293)	2024-06-12 17:27:08 -07:00
Isotr0py	2135cacb45	[Bugfix] Fix wrong multi_modal_input format for CPU runner (#5451 )	2024-06-12 16:20:18 -07:00
Michael Goin	7d19de2e9c	[Frontend] Add "input speed" to tqdm postfix alongside output speed (#5425 )	2024-06-12 18:42:12 -04:00
Michael Goin	94a07bbdd8	[Bugfix] Fix typo in scheduler.py (requeset -> request) (#5470 )	2024-06-12 21:59:44 +00:00
Cyrus Leung	b8d4dfff9c	[Doc] Update debug docs (#5438 )	2024-06-12 14:49:31 -07:00
youkaichao	622d45128c	[misc] add hint for AttributeError (#5462 )	2024-06-12 21:46:35 +00:00
Travis Johnson	51602eefd3	[Frontend] [Core] Support for sharded tensorized models (#4990 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Sanger Steel <sangersteel@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-06-12 14:13:52 -07:00
Arthur Kim	5cc50a531f	[Bugfix] TYPE_CHECKING for MultiModalData (#5444 )	2024-06-12 14:08:52 -07:00
Cody Yu	5985e3427d	[Kernel] Vectorized FP8 quantize kernel (#5396 ) Inspired by #5146, this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large). In details, we applied 3 optimizations: - Use inverted scale so that most divisions are changed to multiplications. - Unroll the loop by 4 times to improve ILP. - Use vectorized 4 to transfer data between HBM and SRAM.	2024-06-12 14:07:26 -07:00
Kevin H. Luu	8b82a89997	[ci] Add AMD, Neuron, Intel tests for AWS CI and turn off default soft fail for GPU tests (#5464 ) Signed-off-by: kevin <kevin@anyscale.com>	2024-06-12 14:00:18 -07:00
Li, Jiang	c3c2903e72	[Bugfix] Add device assertion to TorchSDPA (#5402 )	2024-06-12 12:58:53 -07:00
Woosuk Kwon	1a8bfd92d5	[Hardware] Initial TPU integration (#5292 )	2024-06-12 11:53:03 -07:00
SangBin Cho	847cdcca1c	[CI] Upgrade codespell version. (#5381 )	2024-06-12 10:06:14 -07:00
Simon Mo	e3c12bf6d2	Revert "[CI/Build] Add `is_quant_method_supported` to control quantization test configurations" (#5463 )	2024-06-12 10:03:24 -07:00
Michael Goin	3dd6853bc8	[CI/Build] Add `is_quant_method_supported` to control quantization test configurations (#5253 )	2024-06-12 09:58:02 -07:00
youkaichao	8f89d72090	[Doc] add common case for long waiting time (#5430 )	2024-06-11 11:12:13 -07:00
Nick Hill	99dac099ab	[Core][Doc] Default to multiprocessing for single-node distributed case (#5230 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-06-11 11:10:41 -07:00
youkaichao	c4bd03c7c5	[Core][Distributed] add same-node detection (#5369 )	2024-06-11 10:53:59 -07:00
sasha0552	dcbf4286af	[Frontend] Customizable RoPE theta (#5197 )	2024-06-11 10:42:26 -07:00
Ali Panahi	00e6a2dc53	[Bugfix] fix lora_dtype value type in arg_utils.py (#5398 )	2024-06-11 10:40:23 -07:00
Junichi Sato	2e02311a1b	[Bugfix] Fix `MultiprocessingGPUExecutor.check_health` when world_size == 1 (#5254 )	2024-06-11 10:38:07 -07:00
Cade Daniel	89ec06c33b	[Docs] [Spec decode] Fix docs error in code example (#5427 )	2024-06-11 10:31:56 -07:00
Kuntai Du	9fde251bf0	[Doc] Add an automatic prefix caching section in vllm documentation (#5324 ) Co-authored-by: simon-mo <simon.mo@hey.com>	2024-06-11 10:24:59 -07:00
Cade Daniel	4c2ffb28ff	[Speculative decoding] Initial spec decode docs (#5400 )	2024-06-11 10:15:40 -07:00
SangBin Cho	246598a6b1	[CI] docfix (#5410 ) Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: ywang96 <ywang@roblox.com>	2024-06-11 01:28:50 -07:00
Woosuk Kwon	8bab4959be	[Misc] Remove VLLM_BUILD_WITH_NEURON env variable (#5389 )	2024-06-11 00:37:56 -07:00
Roger Wang	3c4cebf751	[Doc][Typo] Fixing Missing Comma (#5403 )	2024-06-11 00:20:28 -07:00
youkaichao	d8f31f2f8b	[Doc] add debugging tips (#5409 )	2024-06-10 23:21:43 -07:00
Cyrus Leung	640052b069	[Bugfix][Frontend] Cleanup "fix chat logprobs" (#5026 )	2024-06-10 22:36:46 -07:00
maor-ps	351d5e7b82	[Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs (#5312 ) Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-06-11 10:30:31 +08:00
Nick Hill	a008629807	[Misc] Various simplifications and typing fixes (#5368 )	2024-06-11 10:29:02 +08:00

1 2 3 4 5 ...

1588 Commits