[v1][Metrics] Add design doc (#12745)

Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2025-03-04 20:36:55 +00:00 · 2025-03-04 20:36:55 +00:00 · c2bd2196fc
commit c2bd2196fc
parent 550c7ba3dc
5 changed files with 713 additions and 0 deletions
--- a/docs/source/assets/design/v1/metrics/intervals-1.png
+++ b/docs/source/assets/design/v1/metrics/intervals-1.png
--- a/docs/source/assets/design/v1/metrics/intervals-2.png
+++ b/docs/source/assets/design/v1/metrics/intervals-2.png
--- a/docs/source/assets/design/v1/metrics/intervals-3.png
+++ b/docs/source/assets/design/v1/metrics/intervals-3.png
--- a/docs/source/design/v1/metrics.md
+++ b/docs/source/design/v1/metrics.md
@ -0,0 +1,712 @@
+# Metrics
+
+Ensure the v1 LLM Engine exposes a superset of the metrics available in v0.
+
+## Objectives
+
+- Achieve parity of metrics between v0 and v1.
+- The priority use case is accessing these metrics via Prometheus as this is what we expect to be used in production environments.
+- Logging support - i.e. printing metrics to the info log - is provided for more ad-hoc testing, debugging, development, and exploratory use cases.
+
+## Background
+
+Metrics in vLLM can be categorized as follows:
+
+1. Server-level metrics: these are global metrics that track the state and performance of the LLM engine. These are typically exposed as Gauges or Counters in Prometheus.
+2. Request-level metrics: these are metrics that track the characteristics - e.g. size and timing - of individual requests. These are typically exposed as Histrograms in Prometheus, and are often the SLO that an SRE monitoring vLLM will be tracking.
+
+The mental model is that the "Server-level Metrics" explain why the "Request-level Metrics" are what they are.
+
+### v0 Metrics
+
+In v0, the following metrics are exposed via a Prometheus-compatible `/metrics` endpoint using the `vllm:` prefix:
+
+- `vllm:num_requests_running` (Gauge)
+- `vllm:num_requests_swapped` (Gauge)
+- `vllm:num_requests_waiting` (Gauge)
+- `vllm:gpu_cache_usage_perc` (Gauge)
+- `vllm:cpu_cache_usage_perc` (Gauge)
+- `vllm:gpu_prefix_cache_hit_rate` (Gauge)
+- `vllm:cpu_prefix_cache_hit_rate` (Gauge)
+- `vllm:prompt_tokens_total` (Counter)
+- `vllm:generation_tokens_total` (Counter)
+- `vllm:request_success_total` (Counter)
+- `vllm:request_prompt_tokens` (Histogram)
+- `vllm:request_generation_tokens` (Histogram)
+- `vllm:time_to_first_token_seconds` (Histogram)
+- `vllm:time_per_output_token_seconds` (Histogram)
+- `vllm:e2e_request_latency_seconds` (Histogram)
+- `vllm:request_queue_time_seconds` (Histogram)
+- `vllm:request_inference_time_seconds` (Histogram)
+- `vllm:request_prefill_time_seconds` (Histogram)
+- `vllm:request_decode_time_seconds` (Histogram)
+- `vllm:request_max_num_generation_tokens` (Histogram)
+- `vllm:num_preemptions_total` (Counter)
+- `vllm:cache_config_info` (Gauge)
+- `vllm:lora_requests_info` (Gauge)
+- `vllm:tokens_total` (Counter)
+- `vllm:iteration_tokens_total` (Histogram)
+- `vllm:time_in_queue_requests` (Histogram)
+- `vllm:model_forward_time_milliseconds` (Histogram
+- `vllm:model_execute_time_milliseconds` (Histogram)
+- `vllm:request_params_n` (Histogram)
+- `vllm:request_params_max_tokens` (Histogram)
+- `vllm:spec_decode_draft_acceptance_rate` (Gauge)
+- `vllm:spec_decode_efficiency` (Gauge)
+- `vllm:spec_decode_num_accepted_tokens_total` (Counter)
+- `vllm:spec_decode_num_draft_tokens_total` (Counter)
+- `vllm:spec_decode_num_emitted_tokens_total` (Counter)
+
+These are documented under [Inferencing and Serving -> Production Metrics](project:../../serving/metrics.md).
+
+### Grafana Dashboard
+
+vLLM also provides [a reference example](https://docs.vllm.ai/en/latest/getting_started/examples/prometheus_grafana.html) for how to collect and store these metrics using Prometheus and visualize them using a Grafana dashboard.
+
+The subset of metrics exposed in the Grafana dashboard gives us an indication of which metrics are especially important:
+
+- `vllm:e2e_request_latency_seconds_bucket` - End to end request latency measured in seconds
+- `vllm:prompt_tokens_total` - Prompt Tokens/Sec
+- `vllm:generation_tokens_total` - Generation Tokens/Sec
+- `vllm:time_per_output_token_seconds` - Inter token latency (Time Per Output Token, TPOT) in second.
+- `vllm:time_to_first_token_seconds` - Time to First Token (TTFT) latency in seconds.
+- `vllm:num_requests_running` (also, `_swapped` and `_waiting`) - Number of requests in RUNNING, WAITING, and SWAPPED state
+- `vllm:gpu_cache_usage_perc` - Percentage of used cache blocks by vLLM.
+- `vllm:request_prompt_tokens` - Request prompt length
+- `vllm:request_generation_tokens` - request generation length
+- `vllm:request_success_total` - Number of finished requests by their finish reason: either an EOS token was generated or the max sequence length was reached
+- `vllm:request_queue_time_seconds` - Queue Time
+- `vllm:request_prefill_time_seconds` - Requests Prefill Time
+- `vllm:request_decode_time_seconds` - Requests Decode Time
+- `vllm:request_max_num_generation_tokens` - Max Generation Token in Sequence Group
+
+See [the PR which added this Dashboard](gh-pr:2316) for interesting and useful background on the choices made here.
+
+### Prometheus Client Library
+
+Prometheus support was initially added [using the aioprometheus library](gh-pr:1890), but a switch was made quickly to [prometheus_client](gh-pr:2730). The rationale is discussed in both linked PRs.
+
+### Multi-process Mode
+
+In v0, metrics are collected in the engine core process and we use multi-process mode to make them available in the API server process. See <gh-pr:7279>.
+
+### Built in Python/Process Metrics
+
+The following metrics are supported by default by `prometheus_client`, but the are not exposed with multiprocess mode is used:
+
+- `python_gc_objects_collected_total`
+- `python_gc_objects_uncollectable_total`
+- `python_gc_collections_total`
+- `python_info`
+- `process_virtual_memory_bytes`
+- `process_resident_memory_bytes`
+- `process_start_time_seconds`
+- `process_cpu_seconds_total`
+- `process_open_fds`
+- `process_max_fds`
+
+This is relevant because if we move away from multiprocess mode in v1,
+we get these back. However, it's questionable how relevant these are
+if they don't aggregate these stats for all processes that make up a
+vLLM instance.
+
+### v0 PRs and Issues
+
+For background, these are some of the relevant PRs which added the v0 metrics:
+
+- <gh-pr:1890>
+- <gh-pr:2316>
+- <gh-pr:2730>
+- <gh-pr:4464>
+- <gh-pr:7279>
+
+Also note the ["Even Better Observability"](gh-issue:3616) feature where e.g. [a detailed roadmap was laid out](gh-issue:3616#issuecomment-2030858781).
+
+## v1 Design
+
+### v1 PRs
+
+For background, here are the relevant v1 PRs relating to the v1
+metrics issue <gh-issue:10582>:
+
+- <gh-pr:11962>
+- <gh-pr:11973>
+- <gh-pr:10907>
+- <gh-pr:12416>
+- <gh-pr:12478>
+- <gh-pr:12516>
+- <gh-pr:12530>
+- <gh-pr:12561>
+- <gh-pr:12579>
+- <gh-pr:12592>
+- <gh-pr:12644>
+
+### Metrics Collection
+
+In v1, we wish to move computation and overhead out of the engine core
+process to minimize the time between each forward pass.
+
+The overall idea of V1 EngineCore design is:
+- EngineCore is the inner loop. Performance is most critical here
+- AsyncLLM is the outer loop. This is overlapped with GPU execution
+  (ideally), so this is where any "overheads" should be if
+  possible. So AsyncLLM.output_handler_loop is the ideal place for the
+  metrics bookkeeping if possible.
+
+We will achieve this by collecting metrics in the frontend API server,
+and base these metrics on information we can glean from the
+`EngineCoreOutputs` returned by the engine core process to the
+frontend.
+
+### Interval Calculations
+
+Many of our metrics are the time interval between various events in
+the processing of a request. It is best practice to use timestamps
+based on "monotonic time" (`time.monotonic()`) rather than "wall-clock
+time" (`time.time()`) to calculate intervals as the former is
+unaffected by system clock changes (e.g. from NTP).
+
+It's also important to note that monotonic clocks differ between
+processes - each process has its own reference. point. So it is
+meaningless to compare monotonic timestamps from different processes.
+
+Therefore, in order to calculate an interval, we must compare two
+monotonic timestamps from the same process.
+
+### Scheduler Stats
+
+The engine core process will collect some key statistics from the
+scheduler - e.g. the number of requests that were scheduled or waiting
+after the last scheduler pass - and include those statistics in
+`EngineCoreOutputs`.
+
+### Engine Core Events
+
+The engine core will also record the timestamp of certain per-request
+events so that the frontend can calculate the interval between these
+events.
+
+The events are:
+
+- `QUEUED` - when the request was received by the engine core and
+  added to the scheduler queue.
+- `SCHEDULED` - when the request was first scheduled for execution.
+- `PREEMPTED` - the request has been put back in the waiting queue
+  in order to make room for other requests to complete. It will be
+  re-scheduled in future and re-start its prefill phase.
+- `NEW_TOKENS` - when the output included in `EngineCoreOutput` was
+  generated. Since this is common to all requests in a given
+  iteration, we use a single timestamp on `EngineCoreOutputs` to
+  record this event.
+
+And the calculated intervals are:
+
+- Queue interval - between `QUEUED` and most recent `SCHEDULED`.
+- Prefill interval - between most recent `SCHEDULED` and the subsequent
+  first `NEW_TOKENS`.
+- Decode interval - between first (after the most recent `SCHEDULED`) and
+  last `NEW_TOKENS`.
+- Inference interval - between most recent `SCHEDULED` and last `NEW_TOKENS`.
+- Inter-token interval - between successive `NEW_TOKENS`.
+
+Put another way:
+
+:::{image} /assets/design/v1/metrics/intervals-1.png
+:alt: Interval calculations - common case
+:::
+
+We explored the possibility of having the frontend calculate these
+intervals using the timing of events visible by the frontend. However,
+the frontend does not have visibility into the timing of the `QUEUED`
+and `SCHEDULED` events and, since we need to calculate intervals based
+on monotonic timestamps from the same process ... we need the engine
+core to record timestamps for all of these events.
+
+#### Interval Calculations vs Preemptions
+
+When a preemption occurs during decode, since any already generated
+tokens are reused, we consider the preemption as affecting the
+inter-token, decode, and inference intervals.
+
+:::{image} /assets/design/v1/metrics/intervals-2.png
+:alt: Interval calculations - preempted decode
+:::
+
+When a preemption occurs during prefill (assuming such an event
+is possible), we consider the preemption as affecting the
+time-to-first-token and prefill intervals.
+
+:::{image} /assets/design/v1/metrics/intervals-3.png
+:alt: Interval calculations - preempted prefill
+:::
+
+### Frontend Stats Collection
+
+As the frontend processes a single `EngineCoreOutputs` - i.e. the
+output from a single engine core iteration - it collects various
+statistics relating to that iteration:
+
+- The total number of new tokens generated in this iteration.
+- The total number of prompt tokens processed by the prefills that
+  completed in this iteration.
+- The queue intervals for any requests that were scheduled in this
+  iteration.
+- The prefill intervals for any requests that completed prefill in
+  this iteration.
+- The inter-token intervals (Time Per Output Token, TPOT), for all
+  requests included in this iteration.
+- The Time-To-First-Token (TTFT) for any requests that completed
+  prefill in this iteration. However, we calculate this interval
+  relative to when the request was first received by the frontend
+  (`arrival_time`) in order to account for input processing time.
+
+For any requests that were completed in a given iteration, we also
+record:
+
+- The inference and decode intervals - relative to the scheduled and
+  first token events, as described above.
+- End-to-end latency - the interval between frontend `arrival_time`
+  and the frontend receiving the final token.
+
+### Metrics Publishing - Logging
+
+The `LoggingStatLogger` metrics publisher outputs a log `INFO` message
+every 5 seconds with some key metrics:
+
+- The current number of running/waiting requests
+- The current GPU cache usage
+- The number of prompt tokens processed per second over the past 5
+  seconds
+- The number of new tokens generated per second over the past 5
+  seconds
+- The prefix cache hit rate over the most recent 1k kv-cache block queries
+
+### Metrics Publishing - Prometheus
+
+The `PrometheusStatLogger` metrics publisher makes the metrics
+available via a `/metrics` HTTP endpoint in a Prometheus-compatible
+format. A Prometheus instance can then be configured to poll this
+endpoint (e.g. every second) and record the values in its time-series
+database. Prometheus is often used via Grafana, allowing these metrics
+to be graphed over time.
+
+Prometheus supports the following metric types:
+
+- Counter: a value that will increase over time, never reducing, and
+  generally reset to zero when the vLLM instance restarts. For
+  example, the number of tokens generated over the lifetime of the
+  instance.
+- Gauge: a value that goes up and down, for example the number of
+  requests currently scheduled for execution.
+- Histogram: a count of metric samples, recorded in buckets. For
+  example, the number of requests whose TTFT was <1ms, <5ms, <10ms,
+  <20ms, and so on.
+
+Prometheus metrics can also be labelled, allowing metrics to be
+combined according to matching labels. In vLLM, we add a `model_name`
+label to every metric which includes the name of the model served by
+that instance.
+
+Example output:
+
+```bash
+$ curl http://0.0.0.0:8000/metrics
+# HELP vllm:num_requests_running Number of requests in model execution batches.
+# TYPE vllm:num_requests_running gauge
+vllm:num_requests_running{model_name="meta-llama/Llama-3.1-8B-Instruct"} 8.0
+...
+# HELP vllm:generation_tokens_total Number of generation tokens processed.
+# TYPE vllm:generation_tokens_total counter
+vllm:generation_tokens_total{model_name="meta-llama/Llama-3.1-8B-Instruct"} 27453.0
+...
+# HELP vllm:request_success_total Count of successfully processed requests.
+# TYPE vllm:request_success_total counter
+vllm:request_success_total{finished_reason="stop",model_name="meta-llama/Llama-3.1-8B-Instruct"} 1.0
+vllm:request_success_total{finished_reason="length",model_name="meta-llama/Llama-3.1-8B-Instruct"} 131.0
+vllm:request_success_total{finished_reason="abort",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
+...
+# HELP vllm:time_to_first_token_seconds Histogram of time to first token in seconds.
+# TYPE vllm:time_to_first_token_seconds histogram
+vllm:time_to_first_token_seconds_bucket{le="0.001",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
+vllm:time_to_first_token_seconds_bucket{le="0.005",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
+vllm:time_to_first_token_seconds_bucket{le="0.01",model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0
+vllm:time_to_first_token_seconds_bucket{le="0.02",model_name="meta-llama/Llama-3.1-8B-Instruct"} 13.0
+vllm:time_to_first_token_seconds_bucket{le="0.04",model_name="meta-llama/Llama-3.1-8B-Instruct"} 97.0
+vllm:time_to_first_token_seconds_bucket{le="0.06",model_name="meta-llama/Llama-3.1-8B-Instruct"} 123.0
+vllm:time_to_first_token_seconds_bucket{le="0.08",model_name="meta-llama/Llama-3.1-8B-Instruct"} 138.0
+vllm:time_to_first_token_seconds_bucket{le="0.1",model_name="meta-llama/Llama-3.1-8B-Instruct"} 140.0
+vllm:time_to_first_token_seconds_count{model_name="meta-llama/Llama-3.1-8B-Instruct"} 140.0
+```
+
+Note - the choice of histogram buckets to be most useful to users
+across a broad set of use cases is not straightforward and will
+require refinement over time.
+
+### Cache Config Info
+
+`prometheus_client` has support for [Info
+metrics](https://prometheus.github.io/client_python/instrumenting/info/)
+which are equivalent to a `Gauge` whose value is permanently set to 1,
+but exposes interesting key/value pair information via labels. This is
+used for information about an instance that does not change - so it
+only needs to be observed at startup - and allows comparing across
+instances in Prometheus.
+
+We use this concept for the `vllm:cache_config_info` metric:
+
+```
+# HELP vllm:cache_config_info Information of the LLMEngine CacheConfig
+# TYPE vllm:cache_config_info gauge
+vllm:cache_config_info{block_size="16",cache_dtype="auto",calculate_kv_scales="False",cpu_offload_gb="0",enable_prefix_caching="False",gpu_memory_utilization="0.9",...} 1.0
+
+```
+
+However, `prometheus_client` has [never supported Info metrics in
+multiprocessing
+mode](https://github.com/prometheus/client_python/pull/300) - for
+[unclear
+reasons](gh-pr:7279#discussion_r1710417152). We
+simply use a `Gauge` metric set to 1 and
+`multiprocess_mode="mostrecent"` instead.
+
+### LoRA Metrics
+
+The `vllm:lora_requests_info` `Gauge` is somewhat similar, except the
+value is the current wall-clock time, and is updated every iteration.
+
+The label names used are:
+
+- `running_lora_adapters`: a per-adapter count of the number requests
+  running using that adapter, formatted as a comma-separated string.
+- `waiting_lora_adapters`: similar, except counting requests that are
+  waiting to be scheduled.
+- `max_lora` - the static "max number of LoRAs in a single batch."
+  configuration.
+
+Encoding a running/waiting counts for multiple adapters in a
+comma-separated string seems quite misguided - we could use labels to
+distinguish between per-adapter counts. This should be revisited.
+
+Note that `multiprocess_mode="livemostrecent"` is used - the most
+recent metric is used, but only from currently running processes.
+
+This was added in
+<gh-pr:9477> and there is
+[at least one known
+user](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/54). If
+we revisit this design and deprecate the old metric, we should reduce
+the need for a significant deprecation period by making the change in
+v0 also and asking this project to move to the new metric.
+
+### Prefix Cache metrics
+
+The discussion in <gh-issue:10582> about adding prefix cache metrics yielded
+some interesting points which may be relevant to how we approach
+future metrics.
+
+Every time the prefix cache is queried, we record the number of blocks
+queried and the number of queried blocks present in the cache
+(i.e. hits).
+
+However, the metric of interest is the hit rate - i.e. the number of
+hits per query.
+
+In the case of logging, we expect the user is best served by
+calculating the hit rate over a fixed number of the most recent
+queries (the interval is fixed to 1k most recent queries for now).
+
+In the case of Prometheus though, we should take advantage of the
+time-series nature of Prometheus and allow the user to calculate the
+hit rate over an interval of their choosing. For example, a PromQL
+query to calculate the hit interval of the past 5 minutes:
+
+```text
+rate(cache_query_hit[5m]) / rate(cache_query_total[5m])
+```
+
+To achieve this, we should record the queries and hits as counters in
+Prometheus, rather than recording the hit rate as a gauge.
+
+## Deprecated Metrics
+
+### How To Deprecate
+
+Deprecating metrics shouldn't be taken lightly. Users may not notice a
+metric has been deprecated, and may be quite inconvenienced when it is
+suddenly (from their perspective) when it is removed, even if there is
+an equivalent metric for them to use.
+
+As an example, see how `vllm:avg_prompt_throughput_toks_per_s` was
+[deprecated](gh-pr:2764) (with a
+comment in the code),
+[removed](gh-pr:12383), and then
+[noticed by a
+user](gh-issue:13218).
+
+In general:
+
+1) We should be cautious about deprecating metrics, especially since
+   it can be hard to predict the user impact.
+2) We should include a prominent deprecation notice in the help string
+   that is included in the `/metrics' output.
+3) We should list deprecated metrics in user-facing documentation and
+   release notes.
+4) We should consider hiding deprecated metrics behind a CLI argument
+   in order to give administrators [an escape
+   hatch](https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#show-hidden-metrics)
+   for some time before deleting them.
+
+### Unimplemented - `vllm:tokens_total`
+
+Added by <gh-pr:4464>, but apparently never implemented. This can just be
+removed.
+
+### Duplicated - Queue Time
+
+The `vllm:time_in_queue_requests` Histogram metric was added by
+<gh-pr:9659> and its calculation is:
+
+```
+    self.metrics.first_scheduled_time = now
+    self.metrics.time_in_queue = now - self.metrics.arrival_time
+```
+
+Two weeks later, <gh-pr:4464> added `vllm:request_queue_time_seconds` leaving
+us with:
+
+```
+if seq_group.is_finished():
+    if (seq_group.metrics.first_scheduled_time is not None and
+            seq_group.metrics.first_token_time is not None):
+        time_queue_requests.append(
+            seq_group.metrics.first_scheduled_time -
+            seq_group.metrics.arrival_time)
+    ...
+    if seq_group.metrics.time_in_queue is not None:
+        time_in_queue_requests.append(
+            seq_group.metrics.time_in_queue)
+```
+
+This seems duplicative, and one of them should be removed. The latter
+is used by the Grafana dashboard, so we should deprecate or remove the
+former from v0.
+
+### Prefix Cache Hit Rate
+
+See above - we now expose 'queries' and 'hits' counters rather than a
+'hit rate' gauge.
+
+### KV Cache Offloading
+
+Two v0 metrics relate to a "swapped" preemption mode that is no
+longer relevant in v1:
+
+- `vllm:num_requests_swapped`
+- `vllm:cpu_cache_usage_perc`
+
+In this mode, when a request is preempted (e.g. to make room in KV
+cache to complete other requests), we swap kv cache blocks out to CPU
+memory. This is also known as "KV cache offloading" and is configured
+with `--swap-space` and `--preemption-mode`.
+
+In v0, [VLLM has long supported beam
+search](gh-issue:6226). The
+SequenceGroup encapsulated the idea of N Sequences which
+all shared the same prompt kv blocks. This enabled KV cache block
+sharing between requests, and copy-on-write to do branching. CPU
+swapping was intended for these beam search like cases.
+
+Later, the concept of prefix caching was introduced, which allowed KV
+cache blocks to be shared implicitly. This proved to be a better
+option than CPU swapping since blocks can be evicted slowly on demand
+and the part of the prompt that was evicted can be recomputed.
+
+SequenceGroup was removed in V1, although a replacement will be
+required for "parallel sampling" (`n>1`). [Beam search was moved out of
+the core (in
+V0)](gh-issue:8306). There was a
+lot of complex code for a very uncommon feature.
+
+In V1, with prefix caching being better (zero over head) and therefore
+on by default, the preemption and recompute strategy should work
+better.
+
+## Future Work
+
+### Parallel Sampling
+
+Some v0 metrics are only relevant in the context of "parallel
+sampling". This is where the `n` parameter in a request is used to
+request multiple completions from the same prompt.
+
+As part of adding parallel sampling support in <gh-pr:10980> we should
+also add these metrics.
+
+- `vllm:request_params_n` (Histogram)
+
+Observes the value of the 'n' parameter of every finished request.
+
+- `vllm:request_max_num_generation_tokens` (Histogram)
+
+Observes the maximum output length of all sequences in every finished
+sequence group. In the absence of parallel sampling, this is
+equivalent to `vllm:request_generation_tokens`.
+
+### Speculative Decoding
+
+Some v0 metrics are specific to "speculative decoding". This is where
+we generate candidate tokens using a faster, approximate method or
+model and then validate those tokens with the larger model.
+
+- `vllm:spec_decode_draft_acceptance_rate` (Gauge)
+- `vllm:spec_decode_efficiency` (Gauge)
+- `vllm:spec_decode_num_accepted_tokens_total` (Counter)
+- `vllm:spec_decode_num_draft_tokens_total` (Counter)
+- `vllm:spec_decode_num_emitted_tokens_total` (Counter)
+
+There is a PR under review (<gh-pr:12193>) to add "prompt lookup (ngram)"
+seculative decoding to v1. Other techniques will follow. We should
+revisit the v0 metrics in this context.
+
+Note - we should probably expose acceptance rate as separate accepted
+and draft counters, like we do for prefix caching hit rate. Efficiency
+likely also needs similar treatment.
+
+### Autoscaling and Load-balancing
+
+A common use case for our metrics is to support automated scaling of
+vLLM instances.
+
+For related discussion from the [Kubernetes Serving Working
+Group](https://github.com/kubernetes/community/tree/master/wg-serving),
+see:
+
+- [Standardizing Large Model Server Metrics in
+  Kubernetes](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk)
+- [Benchmarking LLM Workloads for Performance Evaluation and
+  Autoscaling in
+  Kubernetes](https://docs.google.com/document/d/1k4Q4X14hW4vftElIuYGDu5KDe2LtV1XammoG-Xi3bbQ)
+- [Inference
+  Perf](https://github.com/kubernetes-sigs/wg-serving/tree/main/proposals/013-inference-perf)
+- <gh-issue:5041> and <gh-pr:12726>.
+  
+This is a non-trivial topic. Consider this comment from Rob:
+
+> I think this metric should focus on trying to estimate what the max
+> concurrency that will cause the average request length > queries per
+> second ... since this is really what will "saturate" the server.
+
+A clear goal is that we should expose the metrics required to detect
+this saturation point, so administrators can implement auto-scaling
+rules based on those. However, in order to do so, we need to have a
+clear view on how an administrator (and automated monitoring system)
+should judge an instance as approaching saturation:
+
+> To identify, what is the saturation point for model server compute
+> (the inflection point where we cannot get more throughput with a
+> higher request rate, but start to incur additional latency) so we
+> can autoscale effectively?
+
+### Metric Naming
+
+Our approach to naming metrics probably deserves to be revisited:
+
+1. The use of colons in metric names seems contrary to ["colons are
+   reserved for user defined recording
+   rules"](https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels)
+2. Most of our metrics follow the convention of ending with units, but
+   not all do.
+3. Some of our metric names end with `_total`:
+
+```
+If there is a suffix of `_total` on the metric name, it will be removed. When
+exposing the time series for counter, a `_total` suffix will be added. This is
+for compatibility between OpenMetrics and the Prometheus text format, as OpenMetrics
+requires the `_total` suffix.
+```
+
+### Adding More Metrics
+
+There is no shortage of ideas for new metrics:
+
+- Examples from other projects like
+  [TGI](https://github.com/IBM/text-generation-inference?tab=readme-ov-file#metrics)
+- Proposals arising from specific use cases, like the Kubernetes
+  auto-scaling topic above
+- Proposals that might arise out of standardisation efforts like
+  [OpenTelemetry Semantic Conventions for Gen
+  AI](https://github.com/open-telemetry/semantic-conventions/tree/main/docs/gen-ai).
+
+We should be cautious in our approach to adding new metrics. While
+metrics are often relatively straightforward to add:
+
+1. They can be difficult to remove - see the section on deprecation
+   above.
+2. They can have a meaningful performance impact when enabled. And
+   metrics are usually of very limited use unless they can be enabled
+   by default and in production.
+3. They have an impact on development and maintenance of the
+   project. Every metric added to v0 has made this v1 effort more
+   time-consuming, and perhaps not all metrics justify this ongoing
+   investment in their maintenance.
+
+## Tracing - OpenTelemetry
+
+Metrics provide an aggregated view over time of the system's
+performance and health. Tracing, on the other hand, tracks individual
+requests as they move through different services and components. Both
+fall under the more general heading of "Observability".
+
+v0 has support for OpenTelemetry tracing:
+
+- Added by <gh-pr:4687>
+- Configured with `--oltp-traces-endpoint` and
+  `--collect-detailed-traces`
+- [OpenTelemetry blog
+  post](https://opentelemetry.io/blog/2024/llm-observability/)
+- [User-facing
+  docs](https://docs.vllm.ai/en/latest/getting_started/examples/opentelemetry.html)
+- [Blog
+  post](https://medium.com/@ronen.schaffer/follow-the-trail-supercharging-vllm-with-opentelemetry-distributed-tracing-aa655229b46f)
+- [IBM product
+  docs](https://www.ibm.com/docs/en/instana-observability/current?topic=mgaa-monitoring-large-language-models-llms-vllm-public-preview)
+  
+OpenTelemetry has a [Gen AI Working
+Group](https://github.com/open-telemetry/community/blob/main/projects/gen-ai.md).
+
+Since metrics is a big enough topic on its own, we are going to tackle
+the topic of tracing in v1 separately.
+
+### OpenTelemetry Model Forward vs Execute Time
+
+In v0, we have the following two metrics:
+
+- `vllm:model_forward_time_milliseconds` (Histogram) - The time spent
+  in the model forward pass when this request was in the batch.
+- `vllm:model_execute_time_milliseconds` (Histogram) - The time spent
+  in the model execute function. This will include model forward,
+  block/sync across workers, cpu-gpu sync time and sampling time.
+
+These metrics are only enabled when OpenTelemetry tracing is enabled
+and if `--collect-detailed-traces=all/model/worker` is used. The
+documentation for this option states:
+
+> collect detailed traces for the specified "modules. This involves
+> use of possibly costly and or blocking operations and hence might
+> have a performance impact.
+
+The metrics were added by <gh-pr:7089> and who up in an OpenTelemetry trace
+as:
+
+```
+-> gen_ai.latency.time_in_scheduler: Double(0.017550230026245117)
+-> gen_ai.latency.time_in_model_forward: Double(3.151565277099609)
+-> gen_ai.latency.time_in_model_execute: Double(3.6468167304992676)
+```
+
+We already have `inference_time` and `decode_time` metrics, so the
+question is whether there are sufficiently common use cases for the
+higher-resolution timings to justify the overhead.
+
+Since we are going to treat the question of OpenTelemetry support
+separately, we will include these particular metrics under that topic.
--- a/docs/source/index.md
+++ b/docs/source/index.md
@ -158,6 +158,7 @@ design/multiprocessing
 :maxdepth: 2

 design/v1/prefix_caching
+design/v1/metrics
 :::

 % How to contribute to the vLLM project