Commit Graph

  • 26507f8973
    [Docs] Fix a link and grammar issue in production-stack.md (#16809) main Michael Yao 2025-04-18 14:42:58 +08:00
  • 9c1d5b456d
    [Doc] add podman setup instructions for official image (#16796) Nathan Weinberg 2025-04-18 02:10:49 -04:00
  • e31045f95c
    [Bugfix] fix pp for llama4 (#16746) Lucia Fang 2025-04-17 22:51:30 -07:00
  • aaec845f8e
    [ROCm] [Attention] Cleanup ROCm output passing (#16431) Luka Govedič 2025-04-18 01:46:45 -04:00
  • 7bdfd29a35
    [Misc] add collect_env to cli and docker image (#16759) rongfu.leng 2025-04-18 13:13:35 +08:00
  • e78587a64c
    Improve-mm-and-pooler-and-decoding-configs (#16789) Harry Mellor 2025-04-18 06:13:32 +01:00
  • 7eb4255628
    [BugFix] Accuracy fix for llama4 int4 - improperly casted scales (#16801) Lucas Wilkinson 2025-04-18 01:13:29 -04:00
  • 6a0f547561
    Add hardware print to TPU V1 test (#16792) Michael Goin 2025-04-17 23:13:26 -06:00
  • 30ed81b7ca
    [V1][Structured Output] Minor modification to _validate_structured_output() (#16748) Shanshan Shen 2025-04-18 13:12:54 +08:00
  • 7a4a5de729
    [Misc] Update outdated note: LMCache now supports chunked prefill (#16697) Chauncey 2025-04-18 13:12:42 +08:00
  • c16fb5dae8
    [Doc] Improve help examples for --compilation-config (#16729) Cyrus Leung 2025-04-18 12:22:34 +08:00
  • e37073efd7
    Add property-based testing for vLLM endpoints using an API defined by an OpenAPI 3.1 schema (#16721) Tarun Kumar 2025-04-18 09:38:27 +05:30
  • 183dad7a85
    [Attention] Update to lastest FA3 code (#13111) Lucas Wilkinson 2025-04-17 18:14:07 -04:00
  • 3408e47159
    [P/D][V1] KV Connector API V1 (#15960) Yihua Cheng 2025-04-17 15:22:40 -05:00
  • 0377b8310b
    [MLA] Simplification to batch P/D reordering (#16673) Nick Hill 2025-04-17 13:12:09 -07:00
  • e4755f7fac
    [V1][Metrics] Fix http metrics middleware (#15894) Mark McLoughlin 2025-04-17 20:52:18 +01:00
  • 92edf35826
    [ROCM] enable aiter fused moe kernel for llama4 bf16 checkpoints (#16674) Sijia(Jackson) Chen 2025-04-17 11:44:34 -07:00
  • eb5819b2d9
    [V1][TPU] Enable Top K (#15489) Nicolò Lucchesi 2025-04-17 20:18:11 +02:00
  • 5989f4684d
    [TPU][V1] Fix padding recompilation when max-num-batched-tokens is not even (#16726) Nicolò Lucchesi 2025-04-17 20:09:57 +02:00
  • 5125d72f02
    [Model] use AutoWeightsLoader for olmoe,opt,orion,persimmon,phi3_small (#16548) rongfu.leng 2025-04-18 01:48:31 +08:00
  • a018e555fd
    [Kernel] Add fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 on NVIDIA H20 (#16753) Ximingwang-09 2025-04-18 00:01:30 +08:00
  • 6211b92273
    [Bugfix]Fix index out of range error in api server log (#16787) Robin 2025-04-18 00:01:07 +08:00
  • 05fcd1b430
    [V1][Perf] Faster incremental detokenization (#15137) Nick Hill 2025-04-17 07:45:24 -07:00
  • 7c02d6a137
    [Doc] Changed explanation of generation_tokens_total and prompt_tokens_total counter type metrics to avoid confusion (#16784) Insu Kim 2025-04-17 23:10:08 +09:00
  • 11c3b98491
    [Doc] Document Matryoshka Representation Learning support (#16770) wang.yuqi 2025-04-17 21:37:37 +08:00
  • dbe7f07001
    [Doc] Make sure to update vLLM when installing latest code (#16781) Cyrus Leung 2025-04-17 20:53:31 +08:00
  • c69bf4ee06
    fix: hyperlink (#16778) Reid 2025-04-17 19:34:20 +08:00
  • d27ea94034
    Improve configs - TokenizerPoolConfig + DeviceConfig (#16603) Harry Mellor 2025-04-17 12:19:42 +01:00
  • 99ed526101
    [Misc] refactor examples series - lmcache (#16758) Reid 2025-04-17 19:02:35 +08:00
  • 207da28186
    [Doc] Fix a 404 link in installation/cpu.md (#16773) Michael Yao 2025-04-17 18:46:21 +08:00
  • 5b1aca2ae3
    [Bugfix] Fix GLM4 model (#16618) intervitens 2025-04-17 13:35:07 +03:00
  • d8e557b5e5
    [doc] add open-webui example (#16747) Reid 2025-04-17 18:27:32 +08:00
  • 61a44a0b22
    [Doc] Add more tips to avoid OOM (#16765) Cyrus Leung 2025-04-17 17:54:34 +08:00
  • a6481525b8
    [misc] ignore marlin_moe_wna16 local gen codes (#16760) DefTruth 2025-04-17 17:15:14 +08:00
  • 8cac35ba43
    [Ray] Improve documentation on batch inference (#16609) Richard Liaw 2025-04-16 22:19:26 -07:00
  • 9dbf7a2dc1
    [V1] Remove log noise when idle (#16735) Russell Bryant 2025-04-17 00:34:08 -04:00
  • 607029e515
    [Bugfix] Revert max_prompt_len validation for decoder-only models. (#16741) David Heineman 2025-04-16 21:33:15 -07:00
  • cb072ce93b
    [Bugfix] Update Florence-2 tokenizer to make grounding tasks work (#16734) Isotr0py 2025-04-17 12:17:39 +08:00
  • 95aca283b4
    [rocm][V0] fix selection logic for custom PA in V0 (#16426) Divakar Verma 2025-04-16 21:52:11 -05:00
  • 2b05b8ce69
    [V1][Frontend] Improve Shutdown And Logs (#11737) Robert Shaw 2025-04-16 22:48:34 -04:00
  • 3c776dcefb
    Adding vllm buildkite job for IBM Power (#16679) Aaruni Aggarwal 2025-04-17 08:17:47 +05:30
  • 2cbd4d2999
    [V1][Spec Dec Bug Fix] Respect Spec Dec Method Specification (#16636) Bryan Lu 2025-04-16 19:47:26 -07:00
  • 3092375e27
    [V1][Performance] Implement custom serializaton for MultiModalKwargs [Rebased] (#16432) Staszek Paśko 2025-04-17 04:28:32 +02:00
  • 3cd91dc955
    Help user create custom model for Transformers backend remote code models (#16719) Harry Mellor 2025-04-17 02:05:59 +01:00
  • 8a7368e069
    [Misc] Remove redundant comment (#16703) Jade Zheng 2025-04-17 08:44:52 +08:00
  • 93e561ec4d
    Improve error for structured output backend selection (#16717) Harry Mellor 2025-04-17 01:35:35 +01:00
  • e1b004839a
    [Hardware] Add processor inputs to platform validation (#16680) Joe Runde 2025-04-16 18:28:42 +02:00
  • ee378f3d49
    [Model] support modernbert (#16648) xsank 2025-04-16 20:30:15 +08:00
  • e82ee40de3
    [Bugfix][Kernel] fix potential cuda graph broken for merge_attn_states kernel (#16693) DefTruth 2025-04-16 18:31:39 +08:00
  • facbe2a114
    [Doc] Improve OOM troubleshooting (#16704) Cyrus Leung 2025-04-16 18:29:48 +08:00
  • 7168920491
    [Misc] refactor examples series (#16708) Reid 2025-04-16 18:16:36 +08:00
  • 21378a2323
    [CI] Cleanup additional_dependencies: [toml] for pre-commit yapf hook (#16405) Kay Yan 2025-04-16 18:05:31 +08:00
  • 976711d9db
    [V1][Structured Output] Move xgrammar related utils to backend_xgrammar.py (#16578) Shanshan Shen 2025-04-16 17:01:36 +08:00
  • 44fa4d556c
    [ROCM] Bind triton version to 3.2 in requirements-built.txt (#16664) Sage Moore 2025-04-15 23:05:28 -07:00
  • 3ac98edcb1
    [Feature] add model aware kv ops helper (#16020) billishyahao 2025-04-16 14:00:43 +08:00
  • 966c742ed2
    Disable remote caching when calling compile_fx (#16611) Richard Zou 2025-04-16 01:18:28 -04:00
  • 0d7d05f4b6
    [Misc] Modify LRUCache touch (#16689) Jee Jee Li 2025-04-16 12:51:38 +08:00
  • 96bb8aa68b
    [Bugfix] fix gpu docker image mis benchmarks dir (#16628) rongfu.leng 2025-04-16 12:21:14 +08:00
  • 3badb0213b
    [Model] Add PLaMo2 (#14323) Shinichi Hemmi 2025-04-16 11:31:30 +09:00
  • fdcb850f14
    [Misc] Enable vLLM to Dynamically Load LoRA from a Remote Server (#10546) Angky William 2025-04-15 15:31:38 -07:00
  • 54a66e5fee
    [Misc] Update compressed-tensors WNA16 to support zero-points (#14211) Dipika Sikka 2025-04-15 09:33:51 -04:00
  • 280d62b8a2
    [Kernel] Remove redundant Exp calculations (#16123) DefTruth 2025-04-15 20:58:37 +08:00
  • 1666e66443
    Add "/server_info" endpoint in api_server to retrieve the vllm_config.  (#16572) Xihui Cang 2025-04-15 19:50:38 +08:00
  • 1575c1701a
    [CI/Build] Fix LoRA OOM (#16624) Jee Jee Li 2025-04-15 16:38:19 +08:00
  • 6ae996a873
    [Misc] refactor argument parsing in examples (#16635) Reid 2025-04-15 16:05:30 +08:00
  • b590adfdc1
    Fix vLLM x torch.compile config caching (#16491) Richard Zou 2025-04-15 02:11:11 -04:00
  • b4fe16c75b
    Add vllm bench [latency, throughput] CLI commands (#16508) Michael Goin 2025-04-15 00:10:35 -06:00
  • bc5dd4f669
    [Bugfix] Fix broken GritLM model and tests (missing pooling_metadata) (#16631) Pooya Davoodi 2025-04-14 23:09:58 -07:00
  • dbb036cf61
    [Bugfix] Fix tests/kernels/test_mamba_ssm_ssd.py (#16623) Tyler Michael Smith 2025-04-15 01:35:38 -04:00
  • 70e7ed841d
    [BugFix]: Update minimum pyzmq version (#16549) Taneem Ibrahim 2025-04-14 22:06:03 -05:00
  • d06ba4ed3f
    [Kernel] moe wna16 marlin kernel (#14447) Jinzhen Lin 2025-04-15 11:05:22 +08:00
  • 6b40996ae8
    [Core][Bugfix] Fix Offline MM Beam Search (#16390) Alex Brooks 2025-04-14 20:33:02 -06:00
  • d2020acac7
    config check sleep mode support oot platforms (#16562) Shuqiao Li 2025-04-15 07:31:50 +08:00
  • 1eb3c2ed48
    [DOC][TPU] Add core idea about avoiding recompilation after warmup (#16614) Chengji Yao 2025-04-14 14:56:06 -07:00
  • c64ee87267
    [Hardware][TPU] Add torchvision to tpu dependency file (#16616) Siyuan Liu 2025-04-14 14:50:46 -07:00
  • b1308b84a3
    [Model][VLM] Add Kimi-VL model support (#16387) courage17340 2025-04-15 05:41:48 +08:00
  • 7b5ecf79bd
    s390x: Fix PyArrow build and add CPU test script for Buildkite CI (#16036) Nishan Acharya 2025-04-14 23:25:32 +05:30
  • 9883a18859
    Fix triton install condition on CPU (#16600) Harry Mellor 2025-04-14 18:06:01 +01:00
  • b3f2fddd17
    [TPU][V1] Fix exponential padding when max-num-batched-tokens is not a power of 2 (#16596) Nicolò Lucchesi 2025-04-14 19:01:05 +02:00
  • aa29841ede
    [Bugfix] Multi-modal caches not acting like LRU caches (#16593) Cyrus Leung 2025-04-15 00:24:16 +08:00
  • 6bf27affb6
    [fix]: Dockerfile.ppc64le fixes for opencv-python and hf-xet (#16048) Md. Shafi Hussain 2025-04-14 21:38:39 +05:30
  • 1dd23386ec
    [Misc] Update usage with mooncake lib for kv transfer (#16523) shangmingc 2025-04-14 19:31:37 +08:00
  • 7cbfc10943
    [Misc] refactor examples (#16563) Reid 2025-04-14 17:59:15 +08:00
  • ce4ddd2d1a
    [Misc] remove warning if triton>=3.2.0 (#16553) DefTruth 2025-04-14 17:39:47 +08:00
  • e51929ebca
    Improve configs - SchedulerConfig (#16533) Harry Mellor 2025-04-14 10:24:16 +01:00
  • dc1b4a6f13
    [Core][V0] Enable regex support with xgrammar (#13228) Russell Bryant 2025-04-13 22:13:38 -04:00
  • 63d2705edb
    [Benchmark][Bugfix] Fix SonnetDataset default values in benchmark_throughput.py (#16556) Jennifer Zhao 2025-04-13 17:20:26 -07:00
  • d085a44082
    Enable PTPC FP8 for CompressedTensorsW8A8Fp8MoEMethod (triton fused_moe) (#16537) Michael Goin 2025-04-13 08:55:18 -06:00
  • f49e5aff11
    [V1][Spec Decode] KV cache slots for eagle heads (#16370) Lily Liu 2025-04-12 19:42:51 -07:00
  • 6c11ecf8d3
    [Bugfix] Validate logit biases to prevent out of vocab ids crashing engine (#16529) Ryan McConville 2025-04-12 21:19:19 +01:00
  • 93e5f3c5fb
    [Perf] Optimize Preparing Inputs for GPU Model Runner (#16484) SnowCharm 2025-04-12 22:54:37 +08:00
  • 70363bccfa
    Fix syntaxWarning: invalid escape sequence '\s' (#16532) Jie Fu (傅杰) 2025-04-12 22:39:42 +08:00
  • 3cdc57669f
    [Misc] Delete redundant code (#16530) Jee Jee Li 2025-04-12 19:21:37 +08:00
  • 68bb122eb4
    [MISC] Make GroupCoordinator compatible with out-of-tree devices (#16464) Huazhong Ji 2025-04-12 17:20:25 +08:00
  • d9fc8cd9da
    [V1] Enable multi-input by default (#15799) Cyrus Leung 2025-04-12 16:52:39 +08:00
  • f069f3ea74
    [Misc] Openai transcription client example use same Whisper model (#16487) Nicolò Lucchesi 2025-04-12 09:27:03 +02:00
  • c5bc0e7fcc
    [Misc] Update chat utils tests (#16520) Cyrus Leung 2025-04-12 14:48:43 +08:00
  • 4a3a518722
    fix: spelling (#16466) Tianer Zhou 2025-04-12 14:24:22 +08:00
  • fbf722c6e6
    [Frontend] support matryoshka representation / support embedding API dimensions (#16331) wang.yuqi 2025-04-12 14:23:10 +08:00
  • e92d7085bf
    [Feature][V1] Add xgrammar to support minLength, maxLength with test (#16516) leon-seidel 2025-04-12 08:22:07 +02:00