.. _spec_decode: Speculative decoding in vLLM ============================ .. warning:: Please note that speculative decoding in vLLM is not yet optimized and does not usually yield inter-token latency reductions for all prompt datasets or sampling parameters. The work to optimize it is ongoing and can be followed in `this issue. `_ This document shows how to use `Speculative Decoding `_ with vLLM. Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference. Speculating with a draft model ------------------------------ The following code configures vLLM to use speculative decoding with a draft model, speculating 5 tokens at a time. .. code-block:: python from vllm import LLM, SamplingParams prompts = [ "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM( model="facebook/opt-6.7b", tensor_parallel_size=1, speculative_model="facebook/opt-125m", num_speculative_tokens=5, use_v2_block_manager=True, ) outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") Speculating by matching n-grams in the prompt --------------------------------------------- The following code configures vLLM to use speculative decoding where proposals are generated by matching n-grams in the prompt. For more information read `this thread. `_ .. code-block:: python from vllm import LLM, SamplingParams prompts = [ "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM( model="facebook/opt-6.7b", tensor_parallel_size=1, speculative_model="[ngram]", num_speculative_tokens=5, ngram_prompt_lookup_max=4, use_v2_block_manager=True, ) outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") Resources for vLLM contributors ------------------------------- * `A Hacker's Guide to Speculative Decoding in vLLM `_ * `What is Lookahead Scheduling in vLLM? `_ * `Information on batch expansion. `_ * `Dynamic speculative decoding `_