vllm/docs/source/models/pooling_models.md
Rafael Vasquez 32aa2059ad
[Docs] Convert rST to MyST (Markdown) (#11145)
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
2024-12-23 22:35:38 +00:00

5.4 KiB

(pooling-models)=

Pooling Models

vLLM also supports pooling models, including embedding, reranking and reward models.

In vLLM, pooling models implement the {class}~vllm.model_executor.models.VllmModelForPooling interface. These models use a {class}~vllm.model_executor.layers.Pooler to extract the final hidden states of the input before returning them.

We currently support pooling models primarily as a matter of convenience.
As shown in the [Compatibility Matrix](#compatibility-matrix), most vLLM features are not applicable to
pooling models as they only work on the generation or decode stage, so performance may not improve as much.

Offline Inference

The {class}~vllm.LLM class provides various methods for offline inference. See Engine Arguments for a list of options when initializing the model.

For pooling models, we support the following {code}task options:

  • Embedding ({code}"embed" / {code}"embedding")
  • Classification ({code}"classify")
  • Sentence Pair Scoring ({code}"score")
  • Reward Modeling ({code}"reward")

The selected task determines the default {class}~vllm.model_executor.layers.Pooler that is used:

  • Embedding: Extract only the hidden states corresponding to the last token, and apply normalization.
  • Classification: Extract only the hidden states corresponding to the last token, and apply softmax.
  • Sentence Pair Scoring: Extract only the hidden states corresponding to the last token, and apply softmax.
  • Reward Modeling: Extract all of the hidden states and return them directly.

When loading Sentence Transformers models, we attempt to override the default pooler based on its Sentence Transformers configuration file ({code}modules.json).

You can customize the model's pooling method via the {code}override_pooler_config option, which takes priority over both the model's and Sentence Transformers's defaults.

LLM.encode

The {class}~vllm.LLM.encode method is available to all pooling models in vLLM. It returns the extracted hidden states directly, which is useful for reward models.

llm = LLM(model="Qwen/Qwen2.5-Math-RM-72B", task="reward")
(output,) = llm.encode("Hello, my name is")

data = output.outputs.data
print(f"Data: {data!r}")

LLM.embed

The {class}~vllm.LLM.embed method outputs an embedding vector for each prompt. It is primarily designed for embedding models.

llm = LLM(model="intfloat/e5-mistral-7b-instruct", task="embed")
(output,) = llm.embed("Hello, my name is")

embeds = output.outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})")

A code example can be found in examples/offline_inference_embedding.py.

LLM.classify

The {class}~vllm.LLM.classify method outputs a probability vector for each prompt. It is primarily designed for classification models.

llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", task="classify")
(output,) = llm.classify("Hello, my name is")

probs = output.outputs.probs
print(f"Class Probabilities: {probs!r} (size={len(probs)})")

A code example can be found in examples/offline_inference_classification.py.

LLM.score

The {class}~vllm.LLM.score method outputs similarity scores between sentence pairs. It is primarily designed for cross-encoder models. These types of models serve as rerankers between candidate query-document pairs in RAG systems.

vLLM can only perform the model inference component (e.g. embedding, reranking) of RAG.
To handle RAG at a higher level, you should use integration frameworks such as [LangChain](https://github.com/langchain-ai/langchain).
llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score")
(output,) = llm.score("What is the capital of France?",
                      "The capital of Brazil is Brasilia.")

score = output.outputs.score
print(f"Score: {score}")

A code example can be found in examples/offline_inference_scoring.py.

Online Inference

Our OpenAI Compatible Server can be used for online inference. Please click on the above link for more details on how to launch the server.

Embeddings API

Our Embeddings API is similar to LLM.embed, accepting both text and multi-modal inputs.

The text-only API is compatible with OpenAI Embeddings API so that you can use OpenAI client to interact with it. A code example can be found in examples/openai_embedding_client.py.

The multi-modal API is an extension of the OpenAI Embeddings API that incorporates OpenAI Chat Completions API, so it is not part of the OpenAI standard. Please see for more details on how to use it.

Score API

Our Score API is similar to LLM.score. Please see this page for more details on how to use it.