2024-12-23 17:35:38 -05:00
(runai-model-streamer)=
2025-01-07 11:20:01 +08:00
# Loading models with Run:ai Model Streamer
2024-12-23 17:35:38 -05:00
Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory.
Further reading can be found in [Run:ai Model Streamer Documentation ](https://github.com/run-ai/runai-model-streamer/blob/master/docs/README.md ).
vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer.
You first need to install vLLM RunAI optional dependency:
```console
2025-01-12 03:17:13 -05:00
pip3 install vllm[runai]
2024-12-23 17:35:38 -05:00
```
To run it as an OpenAI-compatible server, add the `--load-format runai_streamer` flag:
```console
2025-01-12 03:17:13 -05:00
vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer
2024-12-23 17:35:38 -05:00
```
To run model from AWS S3 object store run:
```console
2025-01-12 03:17:13 -05:00
vllm serve s3://core-llm/Llama-3-8b --load-format runai_streamer
2024-12-23 17:35:38 -05:00
```
To run model from a S3 compatible object store run:
```console
2025-01-12 03:17:13 -05:00
RUNAI_STREAMER_S3_USE_VIRTUAL_ADDRESSING=0 AWS_EC2_METADATA_DISABLED=true AWS_ENDPOINT_URL=https://storage.googleapis.com vllm serve s3://core-llm/Llama-3-8b --load-format runai_streamer
2024-12-23 17:35:38 -05:00
```
## Tunable parameters
You can tune parameters using `--model-loader-extra-config` :
You can tune `concurrency` that controls the level of concurrency and number of OS threads reading tensors from the file to the CPU buffer.
For reading from S3, it will be the number of client instances the host is opening to the S3 server.
```console
2025-01-12 03:17:13 -05:00
vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer --model-loader-extra-config '{"concurrency":16}'
2024-12-23 17:35:38 -05:00
```
2024-12-28 21:53:59 +08:00
You can control the size of the CPU Memory buffer to which tensors are read from the file, and limit this size.
2024-12-23 17:35:38 -05:00
You can read further about CPU buffer memory limiting [here ](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md#runai_streamer_memory_limit ).
```console
2025-01-12 03:17:13 -05:00
vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer --model-loader-extra-config '{"memory_limit":5368709120}'
2024-12-23 17:35:38 -05:00
```
2025-01-29 03:38:29 +00:00
:::{note}
2024-12-23 17:35:38 -05:00
For further instructions about tunable parameters and additional parameters configurable through environment variables, read the [Environment Variables Documentation ](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md ).
2025-01-29 03:38:29 +00:00
:::