vllm/docs/source/features/quantization/gguf.md

(gguf)=

# GGUF

```{warning}
Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.
```

```{warning}
Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use [gguf-split](https://github.com/ggerganov/llama.cpp/pull/6135) tool to merge them to a single-file model.
```

To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:

```console
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
```

You can also add `--tensor-parallel-size 2` to enable tensor parallelism inference with 2 GPUs:

```console
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
```

```{warning}
We recommend using the tokenizer from base model instead of GGUF model. Because the tokenizer conversion from GGUF is time-consuming and unstable, especially for some models with large vocab size.
```

You can also use the GGUF model directly through the LLM entrypoint:

```python
from vllm import LLM, SamplingParams

# In this script, we demonstrate how to pass input to the chat method:
conversation = [
   {
      "role": "system",
      "content": "You are a helpful assistant"
   },
   {
      "role": "user",
      "content": "Hello"
   },
   {
      "role": "assistant",
      "content": "Hello! How can I assist you today?"
   },
   {
      "role": "user",
      "content": "Write an essay about the importance of higher education.",
   },
]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
         tokenizer="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.chat(conversation, sampling_params)

# Print the outputs.
for output in outputs:
   prompt = output.prompt
   generated_text = output.outputs[0].text
   print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00			`(gguf)=`

			`# GGUF`

			```{warning}
			`Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.`
			```

			```{warning}
			`Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use [gguf-split](https://github.com/ggerganov/llama.cpp/pull/6135) tool to merge them to a single-file model.`
			```

			`To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:`

			```console
[CI/Build] Add markdown linter (#11857) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2025-01-12 03:17:13 -05:00			`wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf`
			`# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.`
			`vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0`
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00			```

			You can also add `--tensor-parallel-size 2` to enable tensor parallelism inference with 2 GPUs:

			```console
[CI/Build] Add markdown linter (#11857) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2025-01-12 03:17:13 -05:00			`# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.`
			`vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2`
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00			```

			```{warning}
			`We recommend using the tokenizer from base model instead of GGUF model. Because the tokenizer conversion from GGUF is time-consuming and unstable, especially for some models with large vocab size.`
			```

			`You can also use the GGUF model directly through the LLM entrypoint:`

			```python
			`from vllm import LLM, SamplingParams`

			`# In this script, we demonstrate how to pass input to the chat method:`
			`conversation = [`
			`{`
			`"role": "system",`
			`"content": "You are a helpful assistant"`
			`},`
			`{`
			`"role": "user",`
			`"content": "Hello"`
			`},`
			`{`
			`"role": "assistant",`
			`"content": "Hello! How can I assist you today?"`
			`},`
			`{`
			`"role": "user",`
			`"content": "Write an essay about the importance of higher education.",`
			`},`
			`]`

			`# Create a sampling params object.`
			`sampling_params = SamplingParams(temperature=0.8, top_p=0.95)`

			`# Create an LLM.`
			`llm = LLM(model="./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",`
			`tokenizer="TinyLlama/TinyLlama-1.1B-Chat-v1.0")`
			`# Generate texts from the prompts. The output is a list of RequestOutput objects`
			`# that contain the prompt, generated text, and other information.`
			`outputs = llm.chat(conversation, sampling_params)`

			`# Print the outputs.`
			`for output in outputs:`
			`prompt = output.prompt`
			`generated_text = output.outputs[0].text`
			`print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")`
			```