Observability KPI — Monitoring, Diagnostics and Incident Response for LLM Inference Systems

Version 1.3

Stack: vLLM + Prometheus + Grafana + DCGM Exporter + Node Exporter + LMCache

Hardware: NVIDIA RTX 4070 Super (12 GB VRAM, 363 GB/s)

ISM scope: L0 (Hardware) → L4b (GPU Workload Optimization)


1. The Golden Metrics Glossary

Four metrics define the user experience and the health of the inference system.

Metric Full Name LLM Phase Description
TTFT Time To First Token Prefill Reaction time. The engine processes the prompt and generates the first token. Depends on prompt length, GPU availability and — under load — queue time before admission to the batch.
ITL Inter-Token Latency Decode Time between one token and the next during the decode of an in-flight request. Determines the fluidity perceived by the user. Depends on model weight and the density of the current batch. Does not see queue or prefill.
TPOT Time Per Output Token Per-request (total) Average cost per token computed over the total lifetime of the request, defined as E2E / output_tokens (GuideLLM convention). Includes queue, prefill and decode. The metric most commonly used in contractual SLAs.
E2E End-to-End Latency Total Total time from HTTP request to the last token. Includes queuing, prefill, decode and network overhead.

TPOT vs ITL — Which Metric for Which Purpose

Both are per-request metrics, but they measure different segments of the request lifetime:

At empty queue (c=1), TPOT ≈ ITL because TTFT is dominated by prefill compute alone — small and stable. Under saturated load, ITL stays flat because it only measures in-flight decode, while TPOT grows because it includes TTFT, and TTFT under saturation is dominated by queue time of requests waiting for admission to the batch.

Which to use: for operational monitoring, use ITL — it is the native metric exposed by vLLM, and it isolates decode behavior from scheduling and queue noise. For contractual SLAs, use TPOT, because it corresponds to what the end user perceives. The PromQL queries in §5 measure ITL; the Sizing KB covers in detail when to use one or the other during requirements gathering.

TPOT vs E2E — Same Measurement, Different Reading

TPOT and E2E are not independent metrics: they are linked by a fixed relationship.

TPOT = E2E / output_tokens

Both describe the same physical event — the lifetime of a single request — but expressed in different units. E2E reads it as a total duration (seconds), TPOT reads it as a rate (seconds per token). Knowing one plus the output length gives the other automatically.

The reason both exist is that they answer different questions:

The discriminant is output length. For a ~50-token reply, the user perceives “fast or slow” as a single value — E2E is the right reading. For a ~2000-token reply, the user is already reading while generation continues — TPOT becomes the meaningful metric.

SLA implication: an SLA on TPOT and an SLA on E2E are not interchangeable. TPOT commits to a sustainable token rate; E2E commits to a maximum completion time. Under queue saturation, TPOT degrades more gently than E2E for long outputs, because the queue time is spread over many tokens. For mixed workloads (chat + summarization), a dual-SLA formulation is often more appropriate than a single number — for example: “E2E p95 < X for responses ≤ 200 tokens, TPOT p95 < Y for responses > 200 tokens”.

Supporting Metrics

Metric Description Why It Matters
Waiting Requests Requests in queue before entering Running Early saturation indicator. If it grows, TTFT will follow.
Running Requests Requests currently in inference on the GPU Shows the level of active parallelism (continuous batching).
Preempted Requests Requests preempted by the inference engine Direct signal of pressure on the KV cache — the engine is forcibly freeing slots.
GPU KV Cache Usage Percentage of KV cache allocated At ~95-100%, vLLM starts preemption or rejects requests.
GPU Utilization GPU compute utilization (DCGM) Distinguishes whether the GPU is the bottleneck or the problem is elsewhere.
GPU Memory Used VRAM occupied (DCGM) Baseline to understand the margin after model loading.
LMCache Hit Rate Ratio of cache hits to total queries (external prefix cache) Indicates KV cache offloading effectiveness. Low hit rate = the cache is not serving, investigate prompt patterns.
Observed Throughput Effective tokens/sec observed (prompt + generation separated) Real end-to-end throughput, includes scheduling and queuing overhead.
Compute Throughput Pure compute tokens/sec (prefill and decode separated) Raw GPU throughput, isolated from overhead. The delta with Observed Throughput reveals the system overhead.

Per-Request vs System Metrics

A distinction that comes back useful during diagnostics:

When an SLO says “TTFT p95 < 1s”, it means that the 95th percentile of the TTFT distribution measured across all requests in the observation period must stay below 1 second.


2. Observed vs Compute Throughput: Two Measures, Two Questions

The system exposes two classes of throughput metrics that answer different questions.

Prefill and Decode: Why Two Measures Are Needed

Prefill (prompt processing) is compute-bound: the GPU executes a parallel forward pass on all input tokens. Decode (output generation) is memory-bandwidth-bound: at each step the GPU reads model weights from VRAM to produce a single token. Monitoring them as an aggregate hides the nature of the bottleneck.

Observed Throughput (System Throughput)

Tokens processed or generated per unit of wall-clock time. Includes everything: GPU compute, scheduling, queue time, preemption/swap, CPU-side tokenization, PCIe transfers, idle time between batch iterations.

Answers the question: how many tokens/sec is the system producing right now?

Compute Throughput (Engine Throughput)

Tokens processed or generated per unit of effective GPU compute time. Excludes any orchestration overhead.

Answers the question: when the GPU is working, how fast does it work?

Efficiency Gap

Under ideal conditions (no queue, no overhead), Observed ≈ Compute. Under load they diverge:

Efficiency Gap = 1 - (Observed Throughput / Compute Throughput)

Sources of the Overhead (the Delta)

The delta between Observed and Compute captures the sum of:

On a single GPU node with controlled benchmark loads, the gap is typically negligible. In production with thousands of concurrent requests and frequent swap, it can become significant.

Diagnostics: TTFT × Dual Throughput

TTFT Compute Throughput Diagnosis
Rises Stable Orchestration problem (queue, scheduling). Check Waiting Requests and Efficiency Gap.
Rises Drops GPU problem (compute saturation). Check DCGM_FI_DEV_GPU_UTIL.
Stable Stable Healthy system.
Stable Drops Anomaly — possible thermal throttling or GPU error. Check temperature and DCGM logs.

Operational Note

Compute Throughput metrics produce values only when there are active requests. At rest, rate() returns 0 or NaN (division by zero when there is no compute time). This is expected behavior, not an error.


3. Applied Statistics: Percentiles in Inference Systems

Never look only at the average: percentiles tell different stories.

p50 (Median)

The central value. Tells the story of the “typical” user. It is the first indicator to normalize when a problem ends. If p50 is good, most users are satisfied.

p99 (Edge Case)

The worst 1% of requests. Reveals bottlenecks, queues and saturation problems that the median hides. It is the KPI that matters for production SLAs.

The Echo Effect

p99 stays high in graphs even after a problem ends because slow data points remain in the temporal aggregation window until they exit the calculation. If the window is 5 minutes, it takes 5 minutes for p99 to normalize.

How to distinguish it from a real problem: if p50 is already back to normal but p99 stays high, it is almost certainly an echo effect. Wait for the window to drain.

Counter Reset vs Echo Effect

A vertical drop in the p99 line can mean two things:

  1. Statistical flush: slow data points have exited the window — the system is back to healthy.
  2. vLLM restart: Prometheus counters were zeroed by a process restart.

Verification query:

resets(vllm:time_to_first_token_seconds_count[5m])

If the result is > 0, there has been a restart. Annotate it to avoid confusing it with an organic recovery.


4. Diagnostic Tree: From Metric to Root Cause

Operational table for incident response. Cross-reference symptoms to identify the cause.

Symptom Probable Cause Layer Immediate Action
High ITL, low TTFT Slow GPU on decode. Model too heavy or insufficient quantization. L1 Check GPU utilization (DCGM). Consider more aggressive quantization (FP16 → AWQ/GPTQ 4-bit). Lower max_model_len.
Very high TTFT, low ITL Queue saturation. Requests wait in Waiting too long. L3 Check waiting_requests. Reduce concurrent load. Check KV cache usage — if at 100%, prefill is blocked.
High p99, low p50 Healthy system, but outliers (very long prompts, past queue) pollute statistics. Check if echo effect (§3). Check prompt length distribution.
High TTFT, high ITL Total saturation. GPU at max, queue full, no margin. L1+L3 Reduce load immediately. Verify whether the model is appropriate for the hardware.
KV Cache > 95% Risk of preemption or request rejection. Insufficient memory. L3 Lower max_model_len, reduce concurrency, evaluate smaller model or quantization. Check vllm:num_requests_preempted to confirm preemption is active.
GPU Util < 30% with high ITL Memory bandwidth bottleneck (memory-bound). L1 RTX 4070 Super has 363 GB/s — for a 7B FP16 (~14 GB) decode is memory-bound. Quantizing reduces the data volume to read.
Preempted > 0 growing KV cache under active pressure. The engine is evicting requests. L3 Check KV cache usage. Reduce concurrency or max_model_len. If preemption policy is recompute, TTFT of evicted requests will rise.
Low LMCache Hit Rate (< 0.3) with high KV Cache The external cache is not serving. Prompts have insufficient prefix overlap. L3 Check prompt patterns — if very different from each other, prefix cache is ineffective. Evaluate whether LMCache is configured correctly.
Observed Throughput « Compute Throughput High system overhead (scheduling, queuing, preemption). L3+L4a The GPU has free capacity but the system is not using it. Check waiting requests, preemption, continuous batching configuration.

Note on TPOT vs ITL diagnostics: if an external benchmark report (e.g. GuideLLM) shows TPOT rising under load while ITL in the Prometheus dashboards stays stable, this is not an inconsistency: ITL measures in-flight decode, TPOT includes the queue time accumulated in TTFT. The divergence is the expected symptom of queue saturation — use the tree above starting from TTFT and Waiting Requests.


5. Operational PromQL Queries for Grafana

Configuration Principles

The current dashboard uses the following conventions:

Golden Metrics Dashboard

Important on panel labels: the native metric exposed by vLLM is inter_token_latency_seconds, i.e. ITL — not TPOT. Label the panels as “ITL p99 / p50” (not “TPOT”). To measure TPOT, it must be computed from E2E and output_tokens, and it is not exposed directly by vLLM as a histogram.

TTFT p99:

histogram_quantile(0.99, sum by (le) (rate(vllm:time_to_first_token_seconds_bucket{instance="192.168.4.250:8000"}[15s])))

TTFT p50:

histogram_quantile(0.50, sum by (le) (rate(vllm:time_to_first_token_seconds_bucket{instance="192.168.4.250:8000"}[15s])))

ITL p99:

histogram_quantile(0.99, sum by (le) (rate(vllm:inter_token_latency_seconds_bucket{instance="192.168.4.250:8000"}[15s])))

ITL p50:

histogram_quantile(0.50, sum by (le) (rate(vllm:inter_token_latency_seconds_bucket{instance="192.168.4.250:8000"}[15s])))

E2E Latency p99:

histogram_quantile(0.99, sum by (le) (rate(vllm:e2e_request_latency_seconds_bucket{instance="192.168.4.250:8000"}[15s])))

E2E Latency p50:

histogram_quantile(0.50, sum by (le) (rate(vllm:e2e_request_latency_seconds_bucket{instance="192.168.4.250:8000"}[15s])))

System State Dashboard

Waiting Requests:

vllm:num_requests_waiting{instance="192.168.4.250:8000"}

Running Requests:

vllm:num_requests_running{instance="192.168.4.250:8000"}

Preempted Requests:

vllm:num_requests_preempted{instance="192.168.4.250:8000"}

KV Cache Usage (%):

vllm:kv_cache_usage_perc{instance="192.168.4.250:8000"} * 100

Observed Throughput — Prompt (tokens/sec):

rate(vllm:prompt_tokens_total{instance="192.168.4.250:8000"}[15s])

Observed Throughput — Generation (tokens/sec):

rate(vllm:generation_tokens_total{instance="192.168.4.250:8000"}[15s])

Compute Throughput — Prefill (tokens/sec):

rate(vllm:request_prompt_tokens_sum{instance="192.168.4.250:8000"}[15s]) / rate(vllm:request_prefill_time_seconds_sum{instance="192.168.4.250:8000"}[15s])

Compute Throughput — Decode (tokens/sec):

rate(vllm:request_generation_tokens_sum{instance="192.168.4.250:8000"}[15s]) / rate(vllm:request_decode_time_seconds_sum{instance="192.168.4.250:8000"}[15s])

LMCache Dashboard

LMCache Hit Rate:

sum(rate(vllm:external_prefix_cache_hits_total{instance="192.168.4.250:8000"}[15s])) / sum(rate(vllm:external_prefix_cache_queries_total{instance="192.168.4.250:8000"}[15s]))

LMCache Query Rate:

rate(vllm:external_prefix_cache_queries_total{instance="192.168.4.250:8000"}[15s])

GPU Dashboard (DCGM)

GPU Utilization:

DCGM_FI_DEV_GPU_UTIL{instance=~"${instance}", gpu=~"${gpu}"}

GPU Memory Used:

DCGM_FI_DEV_FB_USED{instance=~"${instance}", gpu=~"${gpu}"}

Note: The current dashboard does not include a panel for DCGM_FI_DEV_GPU_TEMP. Adding the panel is recommended for the pre-benchmark checklist (§9), which requires verifying the temperature baseline.

Node Dashboard (Node Exporter)

CPU Usage (%):

100 * (1 - avg by (instance) (irate(node_cpu_seconds_total{job="gpu-node", mode="idle"}[5m])))

RAM Usage (GB):

(node_memory_MemTotal_bytes{job="gpu-node"} - node_memory_MemAvailable_bytes{job="gpu-node"}) / 1024 / 1024 / 1024

SATA Disk IOPS — Read:

rate(node_disk_reads_completed_total{instance="192.168.4.250:9100", device="sda"}[1m])

SATA Disk IOPS — Write:

rate(node_disk_writes_completed_total{instance="192.168.4.250:9100", device="sda"}[1m])

6. Cross-Verification: Logs vs Charts

The Sanity Check

The textual logs of vLLM are the “real-time truth”. If the log reports TTFT = 140ms but the Grafana chart shows 35s, the problem is the chart’s aggregation window, not the server.

Procedure:

  1. Identify the anomaly in the Grafana chart.
  2. Consult vLLM logs for the same temporal interval.
  3. If logs confirm normal values → echo effect (§3). Wait for window flush.
  4. If logs confirm anomalous values → real problem. Proceed with the diagnostic tree (§4).

Useful Commands

Latest vLLM logs (Docker):

docker logs --tail 100 --timestamps <container_name>

Filter by specific timestamp:

docker logs <container_name> 2>&1 | grep "2025-06-15T14:3"

Check recent restarts:

docker inspect --format='' <container_name>

7. Hardware Operational Thresholds — RTX 4070 Super

Reference values for the GPU node (gpu.dielabs.eu).

Parameter Value Alert Threshold Notes
Total VRAM 12 GB GDDR6X Budget for model + KV cache + CUDA overhead (~300-500 MB).
Memory Bandwidth 363 GB/s Limits decode of memory-bound models (e.g. 7B FP16).
GPU Temp (idle) ~35-45°C Pre-benchmark baseline.
GPU Temp (load) ~65-80°C > 83°C Above 83°C the GPU starts thermal throttling, reducing clock and throughput.
GPU Utilization (active inference) 60-95% < 30% with high ITL Below 30% with high latency → memory-bound, not compute-bound.
KV Cache Usage Variable > 95% At saturation, preemption or request rejection.
Power Limit 220W (stock) Verify with nvidia-smi that it has not been reduced.

8. Incident Response Procedure

Operational workflow when an alert is received or an anomaly is noticed.

Step 1: Check p50 and p99 TTFT + ITL

┌───────────────────────────────────────────────────────────┐
│  p50 OK, p99 high     → Echo effect / outliers            │
│                          Check logs. Wait for flush.      │
├───────────────────────────────────────────────────────────┤
│  TTFT high, ITL low   → Queue saturation                  │
│                          Waiting Requests + KV Cache %.   │
├───────────────────────────────────────────────────────────┤
│  ITL high, TTFT low   → GPU bottleneck                    │
│                          DCGM GPU Util + quantization.    │
├───────────────────────────────────────────────────────────┤
│  Everything high      → Total saturation                  │
│                          Reduce load immediately.         │
└───────────────────────────────────────────────────────────┘

Step 2: Check context metrics: Preempted Requests, LMCache Hit Rate, Observed vs Compute Throughput.

Step 3: Verify vLLM logs for confirmation (§6).

Step 4: Apply action from the diagnostic tree (§4).

Step 5: Monitor recovery — p50 normalizes first, p99 follows after temporal window flush.


9. Pre-Benchmark Checklist

To be executed before every benchmark session to ensure clean and comparable data.