[Documentation][Spec Decode] Add documentation about lossless guarantees in Speculative Decoding in vLLM (#7962)
This commit is contained in:
parent
2ee45281a5
commit
2febcf2777
@ -161,6 +161,46 @@ A variety of speculative models of this type are available on HF hub:
|
|||||||
* `granite-7b-instruct-accelerator <https://huggingface.co/ibm-granite/granite-7b-instruct-accelerator>`_
|
* `granite-7b-instruct-accelerator <https://huggingface.co/ibm-granite/granite-7b-instruct-accelerator>`_
|
||||||
* `granite-20b-code-instruct-accelerator <https://huggingface.co/ibm-granite/granite-20b-code-instruct-accelerator>`_
|
* `granite-20b-code-instruct-accelerator <https://huggingface.co/ibm-granite/granite-20b-code-instruct-accelerator>`_
|
||||||
|
|
||||||
|
Lossless guarantees of Speculative Decoding
|
||||||
|
-------------------------------------------
|
||||||
|
In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of
|
||||||
|
speculative decoding, breaking down the guarantees into three key areas:
|
||||||
|
|
||||||
|
1. **Theoretical Losslessness**
|
||||||
|
- Speculative decoding sampling is theoretically lossless up to the precision limits of hardware numerics. Floating-point errors might
|
||||||
|
cause slight variations in output distributions, as discussed
|
||||||
|
in `Accelerating Large Language Model Decoding with Speculative Sampling <https://arxiv.org/pdf/2302.01318>`_
|
||||||
|
|
||||||
|
2. **Algorithmic Losslessness**
|
||||||
|
- vLLM’s implementation of speculative decoding is algorithmically validated to be lossless. Key validation tests include:
|
||||||
|
|
||||||
|
- **Rejection Sampler Convergence**: Ensures that samples from vLLM’s rejection sampler align with the target
|
||||||
|
distribution. `View Test Code <https://github.com/vllm-project/vllm/blob/47b65a550866c7ffbd076ecb74106714838ce7da/tests/samplers/test_rejection_sampler.py#L252>`_
|
||||||
|
|
||||||
|
- **Greedy Sampling Equality**: Confirms that greedy sampling with speculative decoding matches greedy sampling
|
||||||
|
without it. This verifies that vLLM's speculative decoding framework, when integrated with the vLLM forward pass and the vLLM rejection sampler,
|
||||||
|
provides a lossless guarantee. Almost all of the tests in `this directory <https://github.com/vllm-project/vllm/tree/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e>`_
|
||||||
|
verify this property using `this assertion implementation <https://github.com/vllm-project/vllm/blob/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e/conftest.py#L291>`_
|
||||||
|
|
||||||
|
3. **vLLM Logprob Stability**
|
||||||
|
- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the
|
||||||
|
same request across runs. For more details, see the FAQ section
|
||||||
|
titled *Can the output of a prompt vary across runs in vLLM?* in the `FAQs <../serving/faq.rst>`_.
|
||||||
|
|
||||||
|
|
||||||
|
**Conclusion**
|
||||||
|
|
||||||
|
While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
|
||||||
|
can occur due to following factors:
|
||||||
|
|
||||||
|
- **Floating-Point Precision**: Differences in hardware numerical precision may lead to slight discrepancies in the output distribution.
|
||||||
|
|
||||||
|
- **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
|
||||||
|
due to non-deterministic behavior in batched operations or numerical instability.
|
||||||
|
|
||||||
|
**Mitigation Strategies**
|
||||||
|
|
||||||
|
For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the `FAQs <../serving/faq.rst>`_.
|
||||||
|
|
||||||
Resources for vLLM contributors
|
Resources for vLLM contributors
|
||||||
-------------------------------
|
-------------------------------
|
||||||
|
@ -10,3 +10,22 @@ A: Assuming that you're referring to using OpenAI compatible server to serve mul
|
|||||||
Q: Which model to use for offline inference embedding?
|
Q: Which model to use for offline inference embedding?
|
||||||
|
|
||||||
A: If you want to use an embedding model, try: https://huggingface.co/intfloat/e5-mistral-7b-instruct. Instead models, such as Llama-3-8b, Mistral-7B-Instruct-v0.3, are generation models rather than an embedding model
|
A: If you want to use an embedding model, try: https://huggingface.co/intfloat/e5-mistral-7b-instruct. Instead models, such as Llama-3-8b, Mistral-7B-Instruct-v0.3, are generation models rather than an embedding model
|
||||||
|
|
||||||
|
----------------------------------------
|
||||||
|
|
||||||
|
Q: Can the output of a prompt vary across runs in vLLM?
|
||||||
|
|
||||||
|
A: Yes, it can. vLLM does not guarantee stable log probabilities (logprobs) for the output tokens. Variations in logprobs may occur due to
|
||||||
|
numerical instability in Torch operations or non-deterministic behavior in batched Torch operations when batching changes. For more details,
|
||||||
|
see the `Numerical Accuracy section <https://pytorch.org/docs/stable/notes/numerical_accuracy.html#batched-computations-or-slice-computations>`_.
|
||||||
|
|
||||||
|
In vLLM, the same requests might be batched differently due to factors such as other concurrent requests,
|
||||||
|
changes in batch size, or batch expansion in speculative decoding. These batching variations, combined with numerical instability of Torch operations,
|
||||||
|
can lead to slightly different logit/logprob values at each step. Such differences can accumulate, potentially resulting in
|
||||||
|
different tokens being sampled. Once a different token is sampled, further divergence is likely.
|
||||||
|
|
||||||
|
**Mitigation Strategies**
|
||||||
|
|
||||||
|
- For improved stability and reduced variance, use `float32`. Note that this will require more memory.
|
||||||
|
- If using `bfloat16`, switching to `float16` can also help.
|
||||||
|
- Using request seeds can aid in achieving more stable generation for temperature > 0, but discrepancies due to precision differences may still occur.
|
||||||
|
Loading…
x
Reference in New Issue
Block a user