Commit Graph

  • e11880deea
    [Bugfix] Remove triton do_bench fast_flush arg (#16256) Kebe 2025-04-08 21:51:06 +08:00
  • 9351f91be9
    [BugFix][ROCm] Fix GGUF MoE Dispatch Block_Dim for ROCm (#16247) TY-AMD 2025-04-08 20:10:26 +08:00
  • 5a1e1c8353
    [Model] use AutoWeightsLoader for phimoe,qwen2_moe,qwen3_moe (#16203) rongfu.leng 2025-04-08 19:05:47 +08:00
  • 69ecaa7c79
    [Misc] Add warning for multimodal data in LLM.beam_search (#16241) Alex Brooks 2025-04-08 05:05:27 -06:00
  • 7f00899ff7
    [Misc] format and refactor some examples (#16252) Reid 2025-04-08 18:42:32 +08:00
  • 995e3d1f41
    [Docs] Add Slides from Singapore Meetup (#16213) Simon Mo 2025-04-08 00:20:22 -07:00
  • b4ac449a83
    [Misc] Merge the logs of pp layers partitions (#16225) Kebe 2025-04-08 15:18:15 +08:00
  • 8e5314a468
    [V1] Add disable_chunked_mm_input arg to disable partial mm input prefill (#15837) Michael Goin 2025-04-08 00:24:07 -06:00
  • 87918e40c4
    [torch.compile][TPU] Make @support_torch_compile work for XLA backend (#15782) Siyuan Liu 2025-04-07 23:23:53 -07:00
  • f6b32efb7f
    [Bugfix] Fix and reorganize broken GGUF tests and bump gguf version (#16194) Isotr0py 2025-04-08 13:38:13 +08:00
  • b99733d092
    [Bugfix] Do not skip "empty" parts of chats that are parsable (#16219) Michael Goin 2025-04-07 23:14:15 -06:00
  • 05a015d6a5
    Add warning for Attention backends that do not support irope yet (#16212) Yong Hoon Shin 2025-04-07 20:59:26 -07:00
  • ad971af8c7
    [Bugfix] fix use-ep bug to enable ep by dp/tp size > 1 (#16161) zxfan-cpu 2025-04-08 11:48:47 +08:00
  • f2ebb6f541
    [V1] Scatter and gather placeholders in the model runner (#16076) Roger Wang 2025-04-07 19:43:41 -07:00
  • 1d01211264
    Update BASE_IMAGE to 2.22 release of Neuron (#16218) Satyajith Chilappagari 2025-04-07 19:11:18 -07:00
  • f94ab12f79
    [Misc] Update compressed-tensors to version 0.9.3 (#16196) Miles Williams 2025-04-08 03:09:06 +01:00
  • a865bc1ca6
    [core] do not send error across process (#16174) youkaichao 2025-04-08 10:09:03 +08:00
  • 21802c4b6d
    [ROCm][Bugfix][FP8] Make fp8 quant respect fused modules mapping (#16031) Michael Goin 2025-04-07 19:28:14 -06:00
  • 652907b354
    Torchao (#14231) Driss Guessous 2025-04-07 16:39:28 -07:00
  • 24f1c01e0f
    [Bugfix][V0] XGrammar structured output supports Enum (#15878) leon-seidel 2025-04-08 00:38:25 +02:00
  • fad6e2538e
    [Misc] add description attribute in CLI (#15921) Reid 2025-04-08 06:30:35 +08:00
  • 7f6d47c1a2
    [V1][BugFix] Exit properly if engine core fails during startup (#16137) Nick Hill 2025-04-07 15:30:15 -07:00
  • 3147586ebd
    [Bugfix] Fix guidance backend for Qwen models (#16210) Benjamin Chislett 2025-04-07 18:15:43 -04:00
  • ed636d99ca
    [Misc] Move Llama 4 projector call into encoder execution (#16201) Roger Wang 2025-04-07 14:02:05 -07:00
  • 090c856d76
    [Misc] Human-readable max-model-len cli arg (#16181) Nicolò Lucchesi 2025-04-07 20:40:58 +02:00
  • ad434d4cfe
    Print the warning only once (#16193) Gregory Shtrasberg 2025-04-07 14:30:06 -04:00
  • 66d433b94f
    [V1] Revert the default max_num_seqs to V0 values for most hardware (#16158) Cyrus Leung 2025-04-08 01:54:36 +08:00
  • 027b204ff1
    [Bugfix] Re-enable support for ChatGLMForConditionalGeneration (#16187) Cyrus Leung 2025-04-07 23:15:58 +08:00
  • 55dcce91df
    Upstream Llama4 Support to Main (#16113) Lu Fang 2025-04-07 08:06:27 -07:00
  • 8017c8db7f
    [Doc]Update image to latest version (#16186) Robin 2025-04-07 22:17:39 +08:00
  • dc3529dbf6
    [Misc] improve example mlpspeculator and llm_engine_example (#16175) Reid 2025-04-07 19:53:52 +08:00
  • 7699258ef0
    [Model] Add Qwen3 and Qwen3MoE (#15289) YamPengLi 2025-04-07 19:06:41 +08:00
  • e9ba99f296
    [V1][Structured Output] Add supports_structured_output() method to Platform (#16148) Shanshan Shen 2025-04-07 19:06:24 +08:00
  • 7c80368710
    [VLM] Florence-2 supports online serving (#16164) Isotr0py 2025-04-07 19:04:02 +08:00
  • 95d63f38c0
    doc: fix some typos in doc (#16154) yihong 2025-04-07 13:32:06 +08:00
  • bb8dab821e
    [CI] Set max transformers version for Ultravox model test (#16149) Roger Wang 2025-04-06 21:37:58 -07:00
  • fc0f87768a
    [Bugfix] Make dummy encoder prompt padding alternative and add missing warnings (#16129) Isotr0py 2025-04-07 12:07:15 +08:00
  • 0a57386721
    [Misc] Update Mistral-3.1 example (#16147) Cyrus Leung 2025-04-07 11:57:37 +08:00
  • 3749e28774
    [V1][Minor] Minor simplification for get_computed_blocks (#16139) Woosuk Kwon 2025-04-06 20:38:12 -07:00
  • 86fc2321ff
    [Metrics] Add bucket for request_latency, time_to_first_token and time_per_output_token (#15202) Kay Yan 2025-04-07 11:34:51 +08:00
  • 2549c0dfef
    Fix requires-python (#16132) Martin Hoyer 2025-04-07 04:22:25 +02:00
  • b10e519895
    [V1][Minor] Optimize get_cached_block (#16135) Woosuk Kwon 2025-04-06 13:48:14 -07:00
  • 9bde5ba127
    [TPU] Update PyTorch/XLA (#16130) Chengji Yao 2025-04-06 11:25:55 -07:00
  • 72c8f1ad04
    [Misc] update requires-python in pyproject.toml (#16116) Reid 2025-04-06 22:56:34 +08:00
  • da224daaa9
    [Bugfix] add hf_token to EngineArgs (#16093) paolovic 2025-04-06 16:47:33 +02:00
  • 3a100b9278
    [Bugfix] LoRA : Fix the order in which the kernels process LoRAs (#16040) Varun Sundar Rabindranath 2025-04-06 10:04:50 -04:00
  • 242a637aea
    [Model] use AutoWeightsLoader for stablelm,starcoder2,zamba2 (#16103) rongfu.leng 2025-04-06 20:52:01 +08:00
  • c2a9671510
    [Misc] Improve model redirect to accept json dictionary (#16119) Isotr0py 2025-04-06 20:51:45 +08:00
  • d5ae4f7f42
    [Doc][Bugfix] Add missing EOF in k8s deploy doc (#16025) Paul Schweigert 2025-04-06 08:10:57 -04:00
  • b6c502a150
    [Misc] refactor example eagle (#16100) Reid 2025-04-06 17:42:48 +08:00
  • 9ca710e525
    [CI][V1] Fix passing tokenizer as kwarg to validate_guidance_grammar (#16117) Roger Wang 2025-04-06 01:18:00 -07:00
  • eb07c8cb5b
    [Frontend] Fix typo in tool chat templates for llama3.2 and toolace (#14501) Ben Jackson 2025-04-06 00:44:36 -07:00
  • ba10801961
    [Benchmark] Add sampling parameters to benchmark_serving. (#16022) Hyesoo Yang 2025-04-05 21:30:35 -07:00
  • 620fc2d09e
    [Model] fix model testing for TeleChat2ForCausalLM and V0 llama4 (#16112) Lucia Fang 2025-04-05 21:23:40 -07:00
  • 29283eaa7e
    [Model] use AutoWeightsLoader for phi, gemma, deepseek (#16088) Jonghyun Choe 2025-04-06 12:34:38 +09:00
  • 2fa66ef713
    [Bugfix] fix use_atomic_add support of marlin kernel when using v1 engine (#15946) Jinzhen Lin 2025-04-06 11:04:22 +08:00
  • 13affc432d
    [Misc] Remove redundant code (#16098) Chauncey 2025-04-06 11:03:50 +08:00
  • d8f094a92a
    [Misc] format output for encoder_decoder.py (#16095) Reid 2025-04-06 10:57:18 +08:00
  • 97ae6d777f
    Fix some capitalisations in generated examples doc titles (#16094) Harry Mellor 2025-04-05 14:44:03 +01:00
  • 6baeee70d1
    Revert "doc: add info for macos clang errors (#16049)" (#16091) yihong 2025-04-05 19:51:51 +08:00
  • d2517a4939
    [doc] fix 404 (#16082) Reid 2025-04-05 19:39:18 +08:00
  • 6342adc438
    fix: support clang17 for macos and fix the real libomp (#16086) yihong 2025-04-05 19:00:12 +08:00
  • 0adba91547
    [CI] Fix benchmark script level (#16089) Kevin H. Luu 2025-04-05 03:36:01 -07:00
  • 4285e423a6
    [Misc] Auto detect bitsandbytes pre-quantized models (#16027) Tristan Leclercq 2025-04-05 08:30:45 +02:00
  • 63375f0cdb
    [V1][Spec Decode] Update N-gram Proposer Interface (#15750) Woosuk Kwon 2025-04-04 16:32:54 -07:00
  • 70ad3f9e98
    [Bugfix][TPU] Fix V1 TPU worker for sliding window (#16059) Michael Goin 2025-04-04 17:31:19 -06:00
  • d6fc629f4d
    [Kernel][Minor] Re-fuse triton moe weight application (#16071) bnellnm 2025-04-04 19:27:34 -04:00
  • af51d80fa1
    Revert "[V1] Scatter and gather placeholders in the model runner" (#16075) Roger Wang 2025-04-04 14:50:57 -07:00
  • f5722a5052
    [V1] Scatter and gather placeholders in the model runner (#15712) Cyrus Leung 2025-04-05 05:26:44 +08:00
  • 651cf0fec1
    [V1] DP scale-out (1/N): Use zmq ROUTER/DEALER sockets for input queue (#15906) Nick Hill 2025-04-04 12:56:43 -07:00
  • 4dc52e1c53
    [CI] Reorganize .buildkite directory (#16001) Kevin H. Luu 2025-04-04 12:16:20 -07:00
  • 4708f13a9c
    [Bugfix] Fix default behavior/fallback for pp in v1 (#16057) Michael Goin 2025-04-04 11:58:08 -06:00
  • a6d042df0a
    [ROCm][Bugfix] Bring back fallback to eager mode removed in #14917, but for ROCm only (#15413) Gregory Shtrasberg 2025-04-04 12:40:37 -04:00
  • 40a36ccfeb
    [ROCm][Bugfix] Use platform specific FP8 dtype (#15717) Gregory Shtrasberg 2025-04-04 12:40:20 -04:00
  • ef608c37a7
    [Distributed] [ROCM] Fix custom allreduce enable checks (#16010) Ilya Markov 2025-04-04 18:39:08 +02:00
  • 2386803f2a
    [CPU] Change default block_size for CPU backend (#16002) Li, Jiang 2025-04-05 00:39:05 +08:00
  • 95862f7b4d
    [Benchmark][Doc] Update throughput benchmark and README (#15998) Ziji Shi (Steven) 2025-04-04 09:39:02 -07:00
  • 230b131b54
    [Bugfix][kernels] Fix half2float conversion in gguf kernels (#15995) Isotr0py 2025-04-05 00:38:58 +08:00
  • 0812d8dd41
    [Hardware][Gaudi][BugFix] fix arguments of hpu fused moe (#15945) liuzhenwei 2025-04-05 00:38:55 +08:00
  • bf7e3c51ae
    [Model] use AutoWeightsLoader for baichuan, gpt-neox, mpt (#15939) Jonghyun Choe 2025-04-05 01:38:52 +09:00
  • a35a8a8392
    [V1][Spec Decode] Avoid logging useless nan metrics (#16023) Mark McLoughlin 2025-04-04 16:52:41 +01:00
  • 4ef0bb1fcf
    doc: add info for macos clang errors (#16049) yihong 2025-04-04 22:58:16 +08:00
  • fadc59c0e6
    [TPU][V1] Remove ragged attention kernel parameter hard coding (#16041) Chengji Yao 2025-04-04 04:48:50 -07:00
  • 86cbd2eee9
    [Misc] improve gguf check (#15974) Reid 2025-04-04 09:33:36 +08:00
  • 092475f738
    [ROCm] Tweak the benchmark script to run on ROCm (#14252) Huy Do 2025-04-03 17:12:48 -07:00
  • dcc56d62da
    [Bugfix] Fix function names in test_block_fp8.py (#16033) bnellnm 2025-04-03 19:01:34 -04:00
  • f15e70d906
    [TPU] Switch Test to Non-Sliding Window (#15981) Robert Shaw 2025-04-03 14:28:45 -07:00
  • b6be6f8d1e
    [TPU] Support sliding window and logit soft capping in the paged attention kernel for TPU. (#15732) iefgnoix 2025-04-03 14:23:28 -07:00
  • 03a70eacaf
    Re-enable the AMD Testing for the passing tests. (#15586) Alexei-V-Ivanov-AMD 2025-04-03 13:05:17 -05:00
  • 45b1ff7a25
    [Misc][Performance] Advance tpu.txt to the most recent nightly torch … (#16024) yarongmu-google 2025-04-03 10:32:54 -07:00
  • 15ba07ef25
    [Minor] Fused experts refactor (#15914) bnellnm 2025-04-03 13:19:38 -04:00
  • d2b58ca203
    [Neuron][kernel] Fuse kv cache into a single tensor (#15911) Liangfu Chen 2025-04-03 09:51:32 -07:00
  • 82e7e19a6e
    [SupportsQuant] Chameleon, Chatglm, Commandr (#15952) Kyle Sayers 2025-04-03 11:25:22 -04:00
  • 421c462948
    [SupportsQuant] Bert, Blip, Blip2, Bloom (#15573) Kyle Sayers 2025-04-03 11:23:19 -04:00
  • 84884cd9ac
    fix: tiny fix make format.sh excutable (#16015) yihong 2025-04-03 23:18:05 +08:00
  • a43aa183dc
    [doc] update contribution link (#15922) Reid 2025-04-03 18:47:31 +08:00
  • 463bbb1835
    [Bugfix][V1] Fix bug from putting llm_engine.model_executor in a background process (#15367) wwl2755 2025-04-03 02:32:10 -05:00
  • 5e125e74d1
    [misc] improve error message for "Failed to infer device type" (#15994) youkaichao 2025-04-03 14:45:03 +08:00
  • 06f21ce7a5
    [Benchmark] Add AIMO Dataset to Benchmark (#15955) Ziji Shi (Steven) 2025-04-02 23:09:18 -07:00
  • 57a810db9c
    [ROCM][V0] PA kennel selection when no sliding window provided (#15982) Aleksandr Malyshev 2025-04-02 22:28:44 -07:00