rasmith
|
e5697d161c
|
[Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ (#7386)
|
2024-08-28 15:37:47 -04:00 |
|
Patrick von Platen
|
6fc4e6e07a
|
[Model] Add Mistral Tokenization to improve robustness and chat encoding (#7739)
|
2024-08-27 12:40:02 +00:00 |
|
Megha Agarwal
|
2eedede875
|
[Core] Asynchronous Output Processor (#7049)
Co-authored-by: Alexander Matveev <alexm@neuralmagic.com>
|
2024-08-26 20:53:20 -07:00 |
|
zifeitong
|
df1a21131d
|
[Model] Fix Phi-3.5-vision-instruct 'num_crops' issue (#7710)
|
2024-08-22 09:36:24 +08:00 |
|
Nick Hill
|
9b73a2f498
|
[Spec Decoding] Use target model max length as default for draft model (#7706)
|
2024-08-22 00:23:22 +08:00 |
|
Ronen Schaffer
|
2aa00d59ad
|
[CI/Build] Pin OpenTelemetry versions and make errors clearer (#7266)
[CI/Build] Pin OpenTelemetry versions and make a availability errors clearer (#7266)
|
2024-08-20 10:02:21 -07:00 |
|
SangBin Cho
|
ff7ec82c4d
|
[Core] Optimize SPMD architecture with delta + serialization optimization (#7109)
|
2024-08-18 17:57:20 -07:00 |
|
Roger Wang
|
bbf55c4805
|
[VLM] Refactor MultiModalConfig initialization and profiling (#7530)
|
2024-08-17 13:30:55 -07:00 |
|
Besher Alkurdi
|
e73f76eec6
|
[Model] Pipeline parallel support for JAIS (#7603)
|
2024-08-17 11:11:09 -07:00 |
|
Mor Zusman
|
7fc23be81c
|
[Kernel] W8A16 Int8 inside FusedMoE (#7415)
|
2024-08-16 10:06:51 -07:00 |
|
Charlie Fu
|
e837b624f2
|
[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm (#7210)
|
2024-08-16 10:06:30 -07:00 |
|
shangmingc
|
b67ae00cdb
|
[Misc] Add quantization config support for speculative model. (#7343)
|
2024-08-15 19:34:28 -07:00 |
|
William Lin
|
2ecf7b1757
|
[core] [3/N] multi-step args and sequence.py (#7452)
|
2024-08-14 12:32:45 -07:00 |
|
Cyrus Leung
|
3f674a49b5
|
[VLM][Core] Support profiling with multiple multi-modal inputs per prompt (#7126)
|
2024-08-14 17:55:42 +00:00 |
|
youkaichao
|
4d2dc5072b
|
[hardware] unify usage of is_tpu to current_platform.is_tpu() (#7102)
|
2024-08-13 00:16:42 -07:00 |
|
jon-chuang
|
a046f86397
|
[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel (#7208)
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
|
2024-08-12 22:47:41 +00:00 |
|
Cyrus Leung
|
4ddc4743d7
|
[Core] Consolidate GB constant and enable float GB arguments (#7416)
|
2024-08-12 14:14:14 -07:00 |
|
Mahesh Keralapura
|
933790c209
|
[Core] Add span metrics for model_forward, scheduler and sampler time (#7089)
|
2024-08-09 13:55:13 -07:00 |
|
Cyrus Leung
|
7eb4a51c5f
|
[Core] Support serving encoder/decoder models (#7258)
|
2024-08-09 10:39:41 +08:00 |
|
Siyuan Liu
|
0fa14907da
|
[TPU] Add Load-time W8A16 quantization for TPU Backend (#7005)
|
2024-08-08 18:35:49 -07:00 |
|
Jee Jee Li
|
a049b107e2
|
[Misc] Temporarily resolve the error of BitAndBytes (#7308)
|
2024-08-08 13:42:58 -07:00 |
|
Cherilyn Buren
|
48abee9e54
|
[Frontend] remove max_num_batched_tokens limit for lora (#7288)
|
2024-08-08 06:17:29 +00:00 |
|
afeldman-nm
|
fd95e026e0
|
[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942)
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
|
2024-08-06 16:51:47 -04:00 |
|
Jee Jee Li
|
9118217f58
|
[LoRA] Relax LoRA condition (#7146)
|
2024-08-06 01:57:25 +00:00 |
|
Isotr0py
|
360bd67cf0
|
[Core] Support loading GGUF model (#5191)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-08-05 17:54:23 -06:00 |
|
Cade Daniel
|
82a1b1a82b
|
[Speculative decoding] Add periodic log with time spent in proposal/scoring/verification (#6963)
|
2024-08-05 08:46:44 +00:00 |
|
Jee Jee Li
|
f80ab3521c
|
Clean up remaining Punica C information (#7027)
|
2024-08-04 15:37:08 -07:00 |
|
Thomas Parnell
|
b1c9aa3daa
|
[Bugfix] [SpecDecode] Default speculative_draft_tensor_parallel_size to 1 when using MLPSpeculator (#7105)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
2024-08-04 07:13:18 -07:00 |
|
Jeff Fialho
|
825b044863
|
[Frontend] Warn if user max_model_len is greater than derived max_model_len (#7080)
Signed-off-by: Jefferson Fialho <jfialho@ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
|
2024-08-03 16:01:38 -07:00 |
|
Murali Andoorveedu
|
fc912e0886
|
[Models] Support Qwen model with PP (#6974)
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
|
2024-08-01 12:40:43 -07:00 |
|
xuyi
|
1d2e7fb73f
|
[Model] Pipeline parallel support for Qwen2 (#6924)
|
2024-07-31 18:49:51 -07:00 |
|
Michael Goin
|
a0dce9383a
|
[Misc] Add compressed-tensors to optimized quant list (#7006)
|
2024-07-31 14:40:44 -07:00 |
|
Cyrus Leung
|
da1f7cc12a
|
[mypy] Enable following imports for some directories (#6681)
|
2024-07-31 10:38:03 +08:00 |
|
Michael Goin
|
b1366a9534
|
Add Nemotron to PP_SUPPORTED_MODELS (#6863)
|
2024-07-27 15:05:17 -07:00 |
|
chenqianfzh
|
bb5494676f
|
enforce eager mode with bnb quantization temporarily (#6846)
|
2024-07-27 01:32:20 +00:00 |
|
dongmao zhang
|
87525fab92
|
[bitsandbytes]: support read bnb pre-quantized model (#5753)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-07-23 23:45:09 +00:00 |
|
Travis Johnson
|
507ef787d8
|
[Model] Pipeline Parallel Support for DeepSeek v2 (#6519)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
|
2024-07-23 12:22:09 -07:00 |
|
Woosuk Kwon
|
a112a84aad
|
[BugFix] Fix RoPE error in Llama 3.1 (#6693)
|
2024-07-23 09:46:05 -07:00 |
|
Woosuk Kwon
|
461089a21a
|
[Bugfix] Fix a log error in chunked prefill (#6694)
|
2024-07-23 09:27:58 -07:00 |
|
Simon Mo
|
3eda4ec780
|
support ignore patterns in model loader (#6673)
|
2024-07-22 23:59:42 -07:00 |
|
Alexander Matveev
|
396d92d5e0
|
[Kernel][Core] Add AWQ support to the Marlin kernel (#6612)
|
2024-07-21 19:41:42 -04:00 |
|
sroy745
|
14f91fe67c
|
[Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. (#6485)
|
2024-07-20 23:58:58 -07:00 |
|
Robert Shaw
|
683e3cb9c4
|
[ Misc ] fbgemm checkpoints (#6559)
|
2024-07-20 09:36:57 -07:00 |
|
Antoni Baum
|
7bd82002ae
|
[Core] Allow specifying custom Executor (#6557)
|
2024-07-20 01:25:06 +00:00 |
|
Nick Hill
|
b5672a112c
|
[Core] Multiprocessing Pipeline Parallel support (#6130)
Co-authored-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai>
|
2024-07-18 19:15:52 -07:00 |
|
Simon Mo
|
c5df56f88b
|
Add support for a rope extension method (#6553)
|
2024-07-19 01:53:03 +00:00 |
|
youkaichao
|
1c27d25fb5
|
[core][model] yet another cpu offload implementation (#6496)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-07-17 20:54:35 -07:00 |
|
Robert Shaw
|
18fecc3559
|
[ Kernel ] Fp8 Channelwise Weight Support (#6487)
|
2024-07-18 03:18:13 +00:00 |
|
Cody Yu
|
b5af8c223c
|
[Model] Pipeline parallel support for Mixtral (#6516)
|
2024-07-17 19:26:04 -07:00 |
|
Hongxia Yang
|
b6c16cf8ff
|
[ROCm][AMD] unify CUDA_VISIBLE_DEVICES usage in cuda/rocm (#6352)
|
2024-07-11 21:30:46 -07:00 |
|