-
e11880deea
[Bugfix] Remove triton do_bench fast_flush arg (#16256)
Kebe
2025-04-08 21:51:06 +08:00
-
9351f91be9
[BugFix][ROCm] Fix GGUF MoE Dispatch Block_Dim for ROCm (#16247)
TY-AMD
2025-04-08 20:10:26 +08:00
-
5a1e1c8353
[Model] use AutoWeightsLoader for phimoe,qwen2_moe,qwen3_moe (#16203)
rongfu.leng
2025-04-08 19:05:47 +08:00
-
69ecaa7c79
[Misc] Add warning for multimodal data in LLM.beam_search (#16241)
Alex Brooks
2025-04-08 05:05:27 -06:00
-
7f00899ff7
[Misc] format and refactor some examples (#16252)
Reid
2025-04-08 18:42:32 +08:00
-
995e3d1f41
[Docs] Add Slides from Singapore Meetup (#16213)
Simon Mo
2025-04-08 00:20:22 -07:00
-
b4ac449a83
[Misc] Merge the logs of pp layers partitions (#16225)
Kebe
2025-04-08 15:18:15 +08:00
-
8e5314a468
[V1] Add
disable_chunked_mm_input
arg to disable partial mm input prefill (#15837)
Michael Goin
2025-04-08 00:24:07 -06:00
-
87918e40c4
[torch.compile][TPU] Make @support_torch_compile work for XLA backend (#15782)
Siyuan Liu
2025-04-07 23:23:53 -07:00
-
f6b32efb7f
[Bugfix] Fix and reorganize broken GGUF tests and bump gguf version (#16194)
Isotr0py
2025-04-08 13:38:13 +08:00
-
b99733d092
[Bugfix] Do not skip "empty" parts of chats that are parsable (#16219)
Michael Goin
2025-04-07 23:14:15 -06:00
-
05a015d6a5
Add warning for Attention backends that do not support irope yet (#16212)
Yong Hoon Shin
2025-04-07 20:59:26 -07:00
-
ad971af8c7
[Bugfix] fix use-ep bug to enable ep by dp/tp size > 1 (#16161)
zxfan-cpu
2025-04-08 11:48:47 +08:00
-
f2ebb6f541
[V1] Scatter and gather placeholders in the model runner (#16076)
Roger Wang
2025-04-07 19:43:41 -07:00
-
1d01211264
Update BASE_IMAGE to 2.22 release of Neuron (#16218)
Satyajith Chilappagari
2025-04-07 19:11:18 -07:00
-
f94ab12f79
[Misc] Update compressed-tensors to version 0.9.3 (#16196)
Miles Williams
2025-04-08 03:09:06 +01:00
-
a865bc1ca6
[core] do not send error across process (#16174)
youkaichao
2025-04-08 10:09:03 +08:00
-
21802c4b6d
[ROCm][Bugfix][FP8] Make fp8 quant respect fused modules mapping (#16031)
Michael Goin
2025-04-07 19:28:14 -06:00
-
652907b354
Torchao (#14231)
Driss Guessous
2025-04-07 16:39:28 -07:00
-
24f1c01e0f
[Bugfix][V0] XGrammar structured output supports Enum (#15878)
leon-seidel
2025-04-08 00:38:25 +02:00
-
fad6e2538e
[Misc] add description attribute in CLI (#15921)
Reid
2025-04-08 06:30:35 +08:00
-
7f6d47c1a2
[V1][BugFix] Exit properly if engine core fails during startup (#16137)
Nick Hill
2025-04-07 15:30:15 -07:00
-
3147586ebd
[Bugfix] Fix guidance backend for Qwen models (#16210)
Benjamin Chislett
2025-04-07 18:15:43 -04:00
-
ed636d99ca
[Misc] Move Llama 4 projector call into encoder execution (#16201)
Roger Wang
2025-04-07 14:02:05 -07:00
-
090c856d76
[Misc] Human-readable
max-model-len
cli arg (#16181)
Nicolò Lucchesi
2025-04-07 20:40:58 +02:00
-
ad434d4cfe
Print the warning only once (#16193)
Gregory Shtrasberg
2025-04-07 14:30:06 -04:00
-
66d433b94f
[V1] Revert the default
max_num_seqs
to V0 values for most hardware (#16158)
Cyrus Leung
2025-04-08 01:54:36 +08:00
-
027b204ff1
[Bugfix] Re-enable support for
ChatGLMForConditionalGeneration
(#16187)
Cyrus Leung
2025-04-07 23:15:58 +08:00
-
55dcce91df
Upstream Llama4 Support to Main (#16113)
Lu Fang
2025-04-07 08:06:27 -07:00
-
8017c8db7f
[Doc]Update image to latest version (#16186)
Robin
2025-04-07 22:17:39 +08:00
-
dc3529dbf6
[Misc] improve example mlpspeculator and llm_engine_example (#16175)
Reid
2025-04-07 19:53:52 +08:00
-
7699258ef0
[Model] Add Qwen3 and Qwen3MoE (#15289)
YamPengLi
2025-04-07 19:06:41 +08:00
-
e9ba99f296
[V1][Structured Output] Add
supports_structured_output()
method to Platform (#16148)
Shanshan Shen
2025-04-07 19:06:24 +08:00
-
7c80368710
[VLM] Florence-2 supports online serving (#16164)
Isotr0py
2025-04-07 19:04:02 +08:00
-
95d63f38c0
doc: fix some typos in doc (#16154)
yihong
2025-04-07 13:32:06 +08:00
-
bb8dab821e
[CI] Set max transformers version for Ultravox model test (#16149)
Roger Wang
2025-04-06 21:37:58 -07:00
-
fc0f87768a
[Bugfix] Make dummy encoder prompt padding alternative and add missing warnings (#16129)
Isotr0py
2025-04-07 12:07:15 +08:00
-
0a57386721
[Misc] Update Mistral-3.1 example (#16147)
Cyrus Leung
2025-04-07 11:57:37 +08:00
-
3749e28774
[V1][Minor] Minor simplification for get_computed_blocks (#16139)
Woosuk Kwon
2025-04-06 20:38:12 -07:00
-
86fc2321ff
[Metrics] Add bucket for
request_latency
, time_to_first_token
and time_per_output_token
(#15202)
Kay Yan
2025-04-07 11:34:51 +08:00
-
2549c0dfef
Fix requires-python (#16132)
Martin Hoyer
2025-04-07 04:22:25 +02:00
-
b10e519895
[V1][Minor] Optimize get_cached_block (#16135)
Woosuk Kwon
2025-04-06 13:48:14 -07:00
-
9bde5ba127
[TPU] Update PyTorch/XLA (#16130)
Chengji Yao
2025-04-06 11:25:55 -07:00
-
72c8f1ad04
[Misc] update requires-python in pyproject.toml (#16116)
Reid
2025-04-06 22:56:34 +08:00
-
da224daaa9
[Bugfix] add hf_token to EngineArgs (#16093)
paolovic
2025-04-06 16:47:33 +02:00
-
3a100b9278
[Bugfix] LoRA : Fix the order in which the kernels process LoRAs (#16040)
Varun Sundar Rabindranath
2025-04-06 10:04:50 -04:00
-
242a637aea
[Model] use AutoWeightsLoader for stablelm,starcoder2,zamba2 (#16103)
rongfu.leng
2025-04-06 20:52:01 +08:00
-
c2a9671510
[Misc] Improve model redirect to accept json dictionary (#16119)
Isotr0py
2025-04-06 20:51:45 +08:00
-
d5ae4f7f42
[Doc][Bugfix] Add missing EOF in k8s deploy doc (#16025)
Paul Schweigert
2025-04-06 08:10:57 -04:00
-
b6c502a150
[Misc] refactor example eagle (#16100)
Reid
2025-04-06 17:42:48 +08:00
-
9ca710e525
[CI][V1] Fix passing
tokenizer
as kwarg to validate_guidance_grammar
(#16117)
Roger Wang
2025-04-06 01:18:00 -07:00
-
eb07c8cb5b
[Frontend] Fix typo in tool chat templates for llama3.2 and toolace (#14501)
Ben Jackson
2025-04-06 00:44:36 -07:00
-
ba10801961
[Benchmark] Add sampling parameters to benchmark_serving. (#16022)
Hyesoo Yang
2025-04-05 21:30:35 -07:00
-
620fc2d09e
[Model] fix model testing for TeleChat2ForCausalLM and V0 llama4 (#16112)
Lucia Fang
2025-04-05 21:23:40 -07:00
-
29283eaa7e
[Model] use AutoWeightsLoader for phi, gemma, deepseek (#16088)
Jonghyun Choe
2025-04-06 12:34:38 +09:00
-
2fa66ef713
[Bugfix] fix use_atomic_add support of marlin kernel when using v1 engine (#15946)
Jinzhen Lin
2025-04-06 11:04:22 +08:00
-
13affc432d
[Misc] Remove redundant code (#16098)
Chauncey
2025-04-06 11:03:50 +08:00
-
d8f094a92a
[Misc] format output for encoder_decoder.py (#16095)
Reid
2025-04-06 10:57:18 +08:00
-
97ae6d777f
Fix some capitalisations in generated examples doc titles (#16094)
Harry Mellor
2025-04-05 14:44:03 +01:00
-
6baeee70d1
Revert "doc: add info for macos clang errors (#16049)" (#16091)
yihong
2025-04-05 19:51:51 +08:00
-
d2517a4939
[doc] fix 404 (#16082)
Reid
2025-04-05 19:39:18 +08:00
-
6342adc438
fix: support clang17 for macos and fix the real libomp (#16086)
yihong
2025-04-05 19:00:12 +08:00
-
0adba91547
[CI] Fix benchmark script level (#16089)
Kevin H. Luu
2025-04-05 03:36:01 -07:00
-
4285e423a6
[Misc] Auto detect bitsandbytes pre-quantized models (#16027)
Tristan Leclercq
2025-04-05 08:30:45 +02:00
-
63375f0cdb
[V1][Spec Decode] Update N-gram Proposer Interface (#15750)
Woosuk Kwon
2025-04-04 16:32:54 -07:00
-
70ad3f9e98
[Bugfix][TPU] Fix V1 TPU worker for sliding window (#16059)
Michael Goin
2025-04-04 17:31:19 -06:00
-
d6fc629f4d
[Kernel][Minor] Re-fuse triton moe weight application (#16071)
bnellnm
2025-04-04 19:27:34 -04:00
-
af51d80fa1
Revert "[V1] Scatter and gather placeholders in the model runner" (#16075)
Roger Wang
2025-04-04 14:50:57 -07:00
-
f5722a5052
[V1] Scatter and gather placeholders in the model runner (#15712)
Cyrus Leung
2025-04-05 05:26:44 +08:00
-
651cf0fec1
[V1] DP scale-out (1/N): Use zmq ROUTER/DEALER sockets for input queue (#15906)
Nick Hill
2025-04-04 12:56:43 -07:00
-
4dc52e1c53
[CI] Reorganize .buildkite directory (#16001)
Kevin H. Luu
2025-04-04 12:16:20 -07:00
-
4708f13a9c
[Bugfix] Fix default behavior/fallback for pp in v1 (#16057)
Michael Goin
2025-04-04 11:58:08 -06:00
-
a6d042df0a
[ROCm][Bugfix] Bring back fallback to eager mode removed in #14917, but for ROCm only (#15413)
Gregory Shtrasberg
2025-04-04 12:40:37 -04:00
-
40a36ccfeb
[ROCm][Bugfix] Use platform specific FP8 dtype (#15717)
Gregory Shtrasberg
2025-04-04 12:40:20 -04:00
-
ef608c37a7
[Distributed] [ROCM] Fix custom allreduce enable checks (#16010)
Ilya Markov
2025-04-04 18:39:08 +02:00
-
2386803f2a
[CPU] Change default block_size for CPU backend (#16002)
Li, Jiang
2025-04-05 00:39:05 +08:00
-
95862f7b4d
[Benchmark][Doc] Update throughput benchmark and README (#15998)
Ziji Shi (Steven)
2025-04-04 09:39:02 -07:00
-
230b131b54
[Bugfix][kernels] Fix half2float conversion in gguf kernels (#15995)
Isotr0py
2025-04-05 00:38:58 +08:00
-
0812d8dd41
[Hardware][Gaudi][BugFix] fix arguments of hpu fused moe (#15945)
liuzhenwei
2025-04-05 00:38:55 +08:00
-
bf7e3c51ae
[Model] use AutoWeightsLoader for baichuan, gpt-neox, mpt (#15939)
Jonghyun Choe
2025-04-05 01:38:52 +09:00
-
a35a8a8392
[V1][Spec Decode] Avoid logging useless nan metrics (#16023)
Mark McLoughlin
2025-04-04 16:52:41 +01:00
-
4ef0bb1fcf
doc: add info for macos clang errors (#16049)
yihong
2025-04-04 22:58:16 +08:00
-
fadc59c0e6
[TPU][V1] Remove ragged attention kernel parameter hard coding (#16041)
Chengji Yao
2025-04-04 04:48:50 -07:00
-
86cbd2eee9
[Misc] improve gguf check (#15974)
Reid
2025-04-04 09:33:36 +08:00
-
092475f738
[ROCm] Tweak the benchmark script to run on ROCm (#14252)
Huy Do
2025-04-03 17:12:48 -07:00
-
dcc56d62da
[Bugfix] Fix function names in test_block_fp8.py (#16033)
bnellnm
2025-04-03 19:01:34 -04:00
-
f15e70d906
[TPU] Switch Test to Non-Sliding Window (#15981)
Robert Shaw
2025-04-03 14:28:45 -07:00
-
b6be6f8d1e
[TPU] Support sliding window and logit soft capping in the paged attention kernel for TPU. (#15732)
iefgnoix
2025-04-03 14:23:28 -07:00
-
03a70eacaf
Re-enable the AMD Testing for the passing tests. (#15586)
Alexei-V-Ivanov-AMD
2025-04-03 13:05:17 -05:00
-
45b1ff7a25
[Misc][Performance] Advance tpu.txt to the most recent nightly torch … (#16024)
yarongmu-google
2025-04-03 10:32:54 -07:00
-
15ba07ef25
[Minor] Fused experts refactor (#15914)
bnellnm
2025-04-03 13:19:38 -04:00
-
d2b58ca203
[Neuron][kernel] Fuse kv cache into a single tensor (#15911)
Liangfu Chen
2025-04-03 09:51:32 -07:00
-
82e7e19a6e
[SupportsQuant] Chameleon, Chatglm, Commandr (#15952)
Kyle Sayers
2025-04-03 11:25:22 -04:00
-
421c462948
[SupportsQuant] Bert, Blip, Blip2, Bloom (#15573)
Kyle Sayers
2025-04-03 11:23:19 -04:00
-
84884cd9ac
fix: tiny fix make format.sh excutable (#16015)
yihong
2025-04-03 23:18:05 +08:00
-
a43aa183dc
[doc] update contribution link (#15922)
Reid
2025-04-03 18:47:31 +08:00
-
463bbb1835
[Bugfix][V1] Fix bug from putting llm_engine.model_executor in a background process (#15367)
wwl2755
2025-04-03 02:32:10 -05:00
-
5e125e74d1
[misc] improve error message for "Failed to infer device type" (#15994)
youkaichao
2025-04-03 14:45:03 +08:00
-
06f21ce7a5
[Benchmark] Add AIMO Dataset to Benchmark (#15955)
Ziji Shi (Steven)
2025-04-02 23:09:18 -07:00
-
57a810db9c
[ROCM][V0] PA kennel selection when no sliding window provided (#15982)
Aleksandr Malyshev
2025-04-02 22:28:44 -07:00