vllm/docs/source/contributing/profiling/profiling_index.rst

==============
Profiling vLLM
==============

We support tracing vLLM workers using the ``torch.profiler`` module. You can enable tracing by setting the ``VLLM_TORCH_PROFILER_DIR`` environment variable to the directory where you want to save the traces: ``VLLM_TORCH_PROFILER_DIR=/mnt/traces/``

The OpenAI server also needs to be started with the ``VLLM_TORCH_PROFILER_DIR`` environment variable set.

When using ``benchmarks/benchmark_serving.py``, you can enable profiling by passing the ``--profile`` flag.

.. warning::

   Only enable profiling in a development environment. 


Traces can be visualized using https://ui.perfetto.dev/.

.. tip::

   Only send a few requests through vLLM when profiling, as the traces can get quite large. Also, no need to untar the traces, they can be viewed directly.

.. tip::

   To stop the profiler - it flushes out all the profile trace files to the directory. This takes time, for example for about 100 requests worth of data for a llama 70b, it takes about 10 minutes to flush out on a H100.
   Set the env variable VLLM_RPC_TIMEOUT to a big number before you start the server. Say something like 30 minutes.
   ``export VLLM_RPC_TIMEOUT=1800000``
  
Example commands and usage:
===========================

Offline Inference:
------------------

Refer to `examples/offline_inference_with_profiler.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_with_profiler.py>`_ for an example.


OpenAI Server:
--------------

.. code-block:: bash

    VLLM_TORCH_PROFILER_DIR=./vllm_profile python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B 

benchmark_serving.py:

.. code-block:: bash

    python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Meta-Llama-3-70B --dataset-name sharegpt --dataset-path sharegpt.json --profile --num-prompts 2
Doc: Improve benchmark documentation (#9927) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-11-07 02:50:35 -05:00			`==============`
			`Profiling vLLM`
			`==============`
[misc] Add Torch profiler support (#7451) Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> 2024-08-21 15:39:26 -07:00
			We support tracing vLLM workers using the ``torch.profiler`` module. You can enable tracing by setting the ``VLLM_TORCH_PROFILER_DIR`` environment variable to the directory where you want to save the traces: ``VLLM_TORCH_PROFILER_DIR=/mnt/traces/``

			The OpenAI server also needs to be started with the ``VLLM_TORCH_PROFILER_DIR`` environment variable set.

			When using ``benchmarks/benchmark_serving.py``, you can enable profiling by passing the ``--profile`` flag.

			`.. warning::`

			`Only enable profiling in a development environment.`


			`Traces can be visualized using https://ui.perfetto.dev/.`

			`.. tip::`

			`Only send a few requests through vLLM when profiling, as the traces can get quite large. Also, no need to untar the traces, they can be viewed directly.`
[misc] [doc] [frontend] LLM torch profiler support (#7943) 2024-09-06 17:48:48 -07:00
			`.. tip::`

			`To stop the profiler - it flushes out all the profile trace files to the directory. This takes time, for example for about 100 requests worth of data for a llama 70b, it takes about 10 minutes to flush out on a H100.`
[Core][Bugfix][Perf] Introduce `MQLLMEngine` to avoid `asyncio` OH (#8157) Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com> Co-authored-by: Simon Mo <simon.mo@hey.com> 2024-09-18 09:56:58 -04:00			`Set the env variable VLLM_RPC_TIMEOUT to a big number before you start the server. Say something like 30 minutes.`
			``export VLLM_RPC_TIMEOUT=1800000``
[misc] [doc] [frontend] LLM torch profiler support (#7943) 2024-09-06 17:48:48 -07:00
			`Example commands and usage:`
			`===========================`

			`Offline Inference:`
			`------------------`

			Refer to `examples/offline_inference_with_profiler.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_with_profiler.py>`_ for an example.

[misc] Add Torch profiler support (#7451) Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> 2024-08-21 15:39:26 -07:00
			`OpenAI Server:`
[misc] [doc] [frontend] LLM torch profiler support (#7943) 2024-09-06 17:48:48 -07:00			`--------------`
[misc] Add Torch profiler support (#7451) Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> 2024-08-21 15:39:26 -07:00
			`.. code-block:: bash`

[misc] [doc] [frontend] LLM torch profiler support (#7943) 2024-09-06 17:48:48 -07:00			`VLLM_TORCH_PROFILER_DIR=./vllm_profile python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B`
[misc] Add Torch profiler support (#7451) Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> 2024-08-21 15:39:26 -07:00
			`benchmark_serving.py:`

			`.. code-block:: bash`

			`python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Meta-Llama-3-70B --dataset-name sharegpt --dataset-path sharegpt.json --profile --num-prompts 2`