vllm/docs/source/features/quantization/bnb.md

(bits-and-bytes)=

# BitsAndBytes

vLLM now supports [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) for more efficient model inference.
BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy.
Compared to other quantization methods, BitsAndBytes eliminates the need for calibrating the quantized model with input data.

Below are the steps to utilize BitsAndBytes with vLLM.

```console
pip install bitsandbytes>=0.45.3
```

vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint.

You can find bitsandbytes quantized models on <https://huggingface.co/models?other=bitsandbytes>.
And usually, these repositories have a config.json file that includes a quantization_config section.

## Read quantized checkpoint

For pre-quantized checkpoints, vLLM will try to infer the quantization method from the config file, so you don't need to explicitly specify the quantization argument.

```python
from vllm import LLM
import torch
# unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint.
model_id = "unsloth/tinyllama-bnb-4bit"
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True)
```

## Inflight quantization: load as 4bit quantization

For inflight 4bit quantization with BitsAndBytes, you need to explicitly specify the quantization argument.

```python
from vllm import LLM
import torch
model_id = "huggyllama/llama-7b"
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
quantization="bitsandbytes")
```

## OpenAI Compatible Server

Append the following to your model arguments for 4bit inflight quantization:

```console
--quantization bitsandbytes
```
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00			`(bits-and-bytes)=`

			`# BitsAndBytes`

			`vLLM now supports [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) for more efficient model inference.`
			`BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy.`
			`Compared to other quantization methods, BitsAndBytes eliminates the need for calibrating the quantized model with input data.`

			`Below are the steps to utilize BitsAndBytes with vLLM.`

			```console
[Misc] Upgrade BNB version (#15183) 2025-03-24 13:51:42 +08:00			`pip install bitsandbytes>=0.45.3`
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00			```

			`vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint.`

			`You can find bitsandbytes quantized models on <https://huggingface.co/models?other=bitsandbytes>.`
			`And usually, these repositories have a config.json file that includes a quantization_config section.`

[CI/Build] Add markdown linter (#11857) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2025-01-12 03:17:13 -05:00			`## Read quantized checkpoint`
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00
[Misc] Auto detect bitsandbytes pre-quantized models (#16027) Signed-off-by: Tristan Leclercq <tristanleclercq@gmail.com> 2025-04-05 08:30:45 +02:00			`For pre-quantized checkpoints, vLLM will try to infer the quantization method from the config file, so you don't need to explicitly specify the quantization argument.`

[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00			```python
			`from vllm import LLM`
			`import torch`
			`# unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint.`
			`model_id = "unsloth/tinyllama-bnb-4bit"`
[Misc] Auto detect bitsandbytes pre-quantized models (#16027) Signed-off-by: Tristan Leclercq <tristanleclercq@gmail.com> 2025-04-05 08:30:45 +02:00			`llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True)`
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00			```

			`## Inflight quantization: load as 4bit quantization`

[Misc] Auto detect bitsandbytes pre-quantized models (#16027) Signed-off-by: Tristan Leclercq <tristanleclercq@gmail.com> 2025-04-05 08:30:45 +02:00			`For inflight 4bit quantization with BitsAndBytes, you need to explicitly specify the quantization argument.`

[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00			```python
			`from vllm import LLM`
			`import torch`
			`model_id = "huggyllama/llama-7b"`
			`llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \`
[Misc] Clean up the BitsAndBytes arguments (#15140) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> 2025-03-21 10:17:12 +08:00			`quantization="bitsandbytes")`
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00			```
[CI/Build] Add markdown linter (#11857) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2025-01-12 03:17:13 -05:00
Update bnb.md with example for OpenAI (#11718) 2025-01-04 00:29:02 -06:00			`## OpenAI Compatible Server`

[Misc] Auto detect bitsandbytes pre-quantized models (#16027) Signed-off-by: Tristan Leclercq <tristanleclercq@gmail.com> 2025-04-05 08:30:45 +02:00			`Append the following to your model arguments for 4bit inflight quantization:`
Update bnb.md with example for OpenAI (#11718) 2025-01-04 00:29:02 -06:00
[CI/Build] Add markdown linter (#11857) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2025-01-12 03:17:13 -05:00			```console
[Misc] Clean up the BitsAndBytes arguments (#15140) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> 2025-03-21 10:17:12 +08:00			`--quantization bitsandbytes`
Update bnb.md with example for OpenAI (#11718) 2025-01-04 00:29:02 -06:00			```