20231088/vllm - vllm - Luminance Code Repo

20231088/vllm

Author	SHA1	Message	Date
rasmith	e5697d161c	[Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ (#7386 )	2024-08-28 15:37:47 -04:00
Patrick von Platen	6fc4e6e07a	[Model] Add Mistral Tokenization to improve robustness and chat encoding (#7739 )	2024-08-27 12:40:02 +00:00
Megha Agarwal	2eedede875	[Core] Asynchronous Output Processor (#7049 ) Co-authored-by: Alexander Matveev <alexm@neuralmagic.com>	2024-08-26 20:53:20 -07:00
zifeitong	df1a21131d	[Model] Fix Phi-3.5-vision-instruct 'num_crops' issue (#7710 )	2024-08-22 09:36:24 +08:00
Nick Hill	9b73a2f498	[Spec Decoding] Use target model max length as default for draft model (#7706 )	2024-08-22 00:23:22 +08:00
Ronen Schaffer	2aa00d59ad	[CI/Build] Pin OpenTelemetry versions and make errors clearer (#7266 ) [CI/Build] Pin OpenTelemetry versions and make a availability errors clearer (#7266)	2024-08-20 10:02:21 -07:00
SangBin Cho	ff7ec82c4d	[Core] Optimize SPMD architecture with delta + serialization optimization (#7109 )	2024-08-18 17:57:20 -07:00
Roger Wang	bbf55c4805	[VLM] Refactor `MultiModalConfig` initialization and profiling (#7530 )	2024-08-17 13:30:55 -07:00
Besher Alkurdi	e73f76eec6	[Model] Pipeline parallel support for JAIS (#7603 )	2024-08-17 11:11:09 -07:00
Mor Zusman	7fc23be81c	[Kernel] W8A16 Int8 inside FusedMoE (#7415 )	2024-08-16 10:06:51 -07:00
Charlie Fu	e837b624f2	[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm (#7210 )	2024-08-16 10:06:30 -07:00
shangmingc	b67ae00cdb	[Misc] Add quantization config support for speculative model. (#7343 )	2024-08-15 19:34:28 -07:00
William Lin	2ecf7b1757	[core] [3/N] multi-step args and sequence.py (#7452 )	2024-08-14 12:32:45 -07:00
Cyrus Leung	3f674a49b5	[VLM][Core] Support profiling with multiple multi-modal inputs per prompt (#7126 )	2024-08-14 17:55:42 +00:00
youkaichao	4d2dc5072b	[hardware] unify usage of is_tpu to current_platform.is_tpu() (#7102 )	2024-08-13 00:16:42 -07:00
jon-chuang	a046f86397	[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel (#7208 ) Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-08-12 22:47:41 +00:00
Cyrus Leung	4ddc4743d7	[Core] Consolidate `GB` constant and enable float GB arguments (#7416 )	2024-08-12 14:14:14 -07:00
Mahesh Keralapura	933790c209	[Core] Add span metrics for model_forward, scheduler and sampler time (#7089 )	2024-08-09 13:55:13 -07:00
Cyrus Leung	7eb4a51c5f	[Core] Support serving encoder/decoder models (#7258 )	2024-08-09 10:39:41 +08:00
Siyuan Liu	0fa14907da	[TPU] Add Load-time W8A16 quantization for TPU Backend (#7005 )	2024-08-08 18:35:49 -07:00
Jee Jee Li	a049b107e2	[Misc] Temporarily resolve the error of BitAndBytes (#7308 )	2024-08-08 13:42:58 -07:00
Cherilyn Buren	48abee9e54	[Frontend] remove max_num_batched_tokens limit for lora (#7288 )	2024-08-08 06:17:29 +00:00
afeldman-nm	fd95e026e0	[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942 ) Co-authored-by: Andrew Feldman <afeld2012@gmail.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-08-06 16:51:47 -04:00
Jee Jee Li	9118217f58	[LoRA] Relax LoRA condition (#7146 )	2024-08-06 01:57:25 +00:00
Isotr0py	360bd67cf0	[Core] Support loading GGUF model (#5191 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-08-05 17:54:23 -06:00
Cade Daniel	82a1b1a82b	[Speculative decoding] Add periodic log with time spent in proposal/scoring/verification (#6963 )	2024-08-05 08:46:44 +00:00
Jee Jee Li	f80ab3521c	Clean up remaining Punica C information (#7027 )	2024-08-04 15:37:08 -07:00
Thomas Parnell	b1c9aa3daa	[Bugfix] [SpecDecode] Default speculative_draft_tensor_parallel_size to 1 when using MLPSpeculator (#7105 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2024-08-04 07:13:18 -07:00
Jeff Fialho	825b044863	[Frontend] Warn if user `max_model_len` is greater than derived `max_model_len` (#7080 ) Signed-off-by: Jefferson Fialho <jfialho@ibm.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-08-03 16:01:38 -07:00
Murali Andoorveedu	fc912e0886	[Models] Support Qwen model with PP (#6974 ) Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>	2024-08-01 12:40:43 -07:00
xuyi	1d2e7fb73f	[Model] Pipeline parallel support for Qwen2 (#6924 )	2024-07-31 18:49:51 -07:00
Michael Goin	a0dce9383a	[Misc] Add compressed-tensors to optimized quant list (#7006 )	2024-07-31 14:40:44 -07:00
Cyrus Leung	da1f7cc12a	[mypy] Enable following imports for some directories (#6681 )	2024-07-31 10:38:03 +08:00
Michael Goin	b1366a9534	Add Nemotron to PP_SUPPORTED_MODELS (#6863 )	2024-07-27 15:05:17 -07:00
chenqianfzh	bb5494676f	enforce eager mode with bnb quantization temporarily (#6846 )	2024-07-27 01:32:20 +00:00
dongmao zhang	87525fab92	[bitsandbytes]: support read bnb pre-quantized model (#5753 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-07-23 23:45:09 +00:00
Travis Johnson	507ef787d8	[Model] Pipeline Parallel Support for DeepSeek v2 (#6519 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>	2024-07-23 12:22:09 -07:00
Woosuk Kwon	a112a84aad	[BugFix] Fix RoPE error in Llama 3.1 (#6693 )	2024-07-23 09:46:05 -07:00
Woosuk Kwon	461089a21a	[Bugfix] Fix a log error in chunked prefill (#6694 )	2024-07-23 09:27:58 -07:00
Simon Mo	3eda4ec780	support ignore patterns in model loader (#6673 )	2024-07-22 23:59:42 -07:00
Alexander Matveev	396d92d5e0	[Kernel][Core] Add AWQ support to the Marlin kernel (#6612 )	2024-07-21 19:41:42 -04:00
sroy745	14f91fe67c	[Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. (#6485 )	2024-07-20 23:58:58 -07:00
Robert Shaw	683e3cb9c4	[ Misc ] `fbgemm` checkpoints (#6559 )	2024-07-20 09:36:57 -07:00
Antoni Baum	7bd82002ae	[Core] Allow specifying custom Executor (#6557 )	2024-07-20 01:25:06 +00:00
Nick Hill	b5672a112c	[Core] Multiprocessing Pipeline Parallel support (#6130 ) Co-authored-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai>	2024-07-18 19:15:52 -07:00
Simon Mo	c5df56f88b	Add support for a rope extension method (#6553 )	2024-07-19 01:53:03 +00:00
youkaichao	1c27d25fb5	[core][model] yet another cpu offload implementation (#6496 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-07-17 20:54:35 -07:00
Robert Shaw	18fecc3559	[ Kernel ] Fp8 Channelwise Weight Support (#6487 )	2024-07-18 03:18:13 +00:00
Cody Yu	b5af8c223c	[Model] Pipeline parallel support for Mixtral (#6516 )	2024-07-17 19:26:04 -07:00
Hongxia Yang	b6c16cf8ff	[ROCm][AMD] unify CUDA_VISIBLE_DEVICES usage in cuda/rocm (#6352 )	2024-07-11 21:30:46 -07:00

1 2 3 4 5 ...

270 Commits