Observability KPI — Monitoring, Diagnostics and Incident Response for LLM Inference Systems

Version 1.3

Stack: vLLM + Prometheus + Grafana + DCGM Exporter + Node Exporter + LMCache

Hardware: NVIDIA RTX 4070 Super (12 GB VRAM, 363 GB/s)

ISM scope: L0 (Hardware) → L4b (GPU Workload Optimization)

1. The Golden Metrics Glossary

Four metrics define the user experience and the health of the inference system.

Metric	Full Name	LLM Phase	Description
TTFT	Time To First Token	Prefill	Reaction time. The engine processes the prompt and generates the first token. Depends on prompt length, GPU availability and — under load — queue time before admission to the batch.
ITL	Inter-Token Latency	Decode	Time between one token and the next during the decode of an in-flight request. Determines the fluidity perceived by the user. Depends on model weight and the density of the current batch. Does not see queue or prefill.
TPOT	Time Per Output Token	Per-request (total)	Average cost per token computed over the total lifetime of the request, defined as `E2E / output_tokens` (GuideLLM convention). Includes queue, prefill and decode. The metric most commonly used in contractual SLAs.
E2E	End-to-End Latency	Total	Total time from HTTP request to the last token. Includes queuing, prefill, decode and network overhead.

TPOT vs ITL — Which Metric for Which Purpose

Both are per-request metrics, but they measure different segments of the request lifetime:

ITL is the interval between consecutive tokens during the decode of a request already in-flight in the batch. It does not include queue or prefill. It is a system metric observed per-request: it tells how fast decode runs when the request is actually executing.
TPOT is the average cost per token computed over the entire lifetime of the request. It includes queue, prefill and decode. It is a user-facing per-request metric: it tells how fast that user perceived the tokens arriving on average.

At empty queue (c=1), TPOT ≈ ITL because TTFT is dominated by prefill compute alone — small and stable. Under saturated load, ITL stays flat because it only measures in-flight decode, while TPOT grows because it includes TTFT, and TTFT under saturation is dominated by queue time of requests waiting for admission to the batch.

Which to use: for operational monitoring, use ITL — it is the native metric exposed by vLLM, and it isolates decode behavior from scheduling and queue noise. For contractual SLAs, use TPOT, because it corresponds to what the end user perceives. The PromQL queries in §5 measure ITL; the Sizing KB covers in detail when to use one or the other during requirements gathering.

TPOT vs E2E — Same Measurement, Different Reading

TPOT and E2E are not independent metrics: they are linked by a fixed relationship.

TPOT = E2E / output_tokens

Both describe the same physical event — the lifetime of a single request — but expressed in different units. E2E reads it as a total duration (seconds), TPOT reads it as a rate (seconds per token). Knowing one plus the output length gives the other automatically.

The reason both exist is that they answer different questions:

E2E — “How long do I wait in total?” Relevant for short outputs and chat-style replies, where the user perceives the response as a single block.
TPOT — “How fast do the tokens arrive on average?” Relevant for long outputs and streaming generation, where the user starts reading before the response is complete and judges the flow rather than the total wait.

The discriminant is output length. For a ~50-token reply, the user perceives “fast or slow” as a single value — E2E is the right reading. For a ~2000-token reply, the user is already reading while generation continues — TPOT becomes the meaningful metric.

SLA implication: an SLA on TPOT and an SLA on E2E are not interchangeable. TPOT commits to a sustainable token rate; E2E commits to a maximum completion time. Under queue saturation, TPOT degrades more gently than E2E for long outputs, because the queue time is spread over many tokens. For mixed workloads (chat + summarization), a dual-SLA formulation is often more appropriate than a single number — for example: “E2E p95 < X for responses ≤ 200 tokens, TPOT p95 < Y for responses > 200 tokens”.

Supporting Metrics

Metric	Description	Why It Matters
Waiting Requests	Requests in queue before entering Running	Early saturation indicator. If it grows, TTFT will follow.
Running Requests	Requests currently in inference on the GPU	Shows the level of active parallelism (continuous batching).
Preempted Requests	Requests preempted by the inference engine	Direct signal of pressure on the KV cache — the engine is forcibly freeing slots.
GPU KV Cache Usage	Percentage of KV cache allocated	At ~95-100%, vLLM starts preemption or rejects requests.
GPU Utilization	GPU compute utilization (DCGM)	Distinguishes whether the GPU is the bottleneck or the problem is elsewhere.
GPU Memory Used	VRAM occupied (DCGM)	Baseline to understand the margin after model loading.
LMCache Hit Rate	Ratio of cache hits to total queries (external prefix cache)	Indicates KV cache offloading effectiveness. Low hit rate = the cache is not serving, investigate prompt patterns.
Observed Throughput	Effective tokens/sec observed (prompt + generation separated)	Real end-to-end throughput, includes scheduling and queuing overhead.
Compute Throughput	Pure compute tokens/sec (prefill and decode separated)	Raw GPU throughput, isolated from overhead. The delta with Observed Throughput reveals the system overhead.

Per-Request vs System Metrics

A distinction that comes back useful during diagnostics:

Per-request metrics — one measurement for each served request. Across a population of requests, distributions (p50, p95, p99) are then computed. They are: TTFT, ITL, TPOT, E2E, output_tokens, input_tokens.
System metrics — one measurement per time interval or per instantaneous state, aggregated across all in-flight requests. They are: throughput in tok/s, RPS, Waiting/Running/Preempted Requests, KV Cache Usage, GPU Utilization.

When an SLO says “TTFT p95 < 1s”, it means that the 95th percentile of the TTFT distribution measured across all requests in the observation period must stay below 1 second.

2. Observed vs Compute Throughput: Two Measures, Two Questions

The system exposes two classes of throughput metrics that answer different questions.

Prefill and Decode: Why Two Measures Are Needed

Prefill (prompt processing) is compute-bound: the GPU executes a parallel forward pass on all input tokens. Decode (output generation) is memory-bandwidth-bound: at each step the GPU reads model weights from VRAM to produce a single token. Monitoring them as an aggregate hides the nature of the bottleneck.

Observed Throughput (System Throughput)

Tokens processed or generated per unit of wall-clock time. Includes everything: GPU compute, scheduling, queue time, preemption/swap, CPU-side tokenization, PCIe transfers, idle time between batch iterations.

Answers the question: how many tokens/sec is the system producing right now?

Compute Throughput (Engine Throughput)

Tokens processed or generated per unit of effective GPU compute time. Excludes any orchestration overhead.

Answers the question: when the GPU is working, how fast does it work?

Efficiency Gap

Under ideal conditions (no queue, no overhead), Observed ≈ Compute. Under load they diverge:

Efficiency Gap = 1 - (Observed Throughput / Compute Throughput)

Gap ≈ 0% → efficient system, almost all time is useful compute.
Gap 20-40% → significant orchestration overhead.

Sources of the Overhead (the Delta)

The delta between Observed and Compute captures the sum of:

Scheduling delay — the continuous batching scheduler decides which requests to include; the GPU waits.
Queue time — requests in waiting do not produce tokens, but time keeps flowing.
Preemption and KV cache swap — manifests only under VRAM pressure or with KV offload active (e.g. LMCache local_cpu: true). The scheduler interrupts a request, copies the KV state via PCIe (swap out/in). Dead time for compute.
KV cache block management — PagedAttention block allocation/deallocation (CPU-side).
Tokenization/Detokenization — text↔token ID conversion (CPU-side).
PCIe host↔device transfers — input token IDs to GPU, logits/output to CPU.
Batch padding and tensor reorganization — requests with different lengths in continuous batching.

On a single GPU node with controlled benchmark loads, the gap is typically negligible. In production with thousands of concurrent requests and frequent swap, it can become significant.

Diagnostics: TTFT × Dual Throughput

TTFT	Compute Throughput	Diagnosis
Rises	Stable	Orchestration problem (queue, scheduling). Check Waiting Requests and Efficiency Gap.
Rises	Drops	GPU problem (compute saturation). Check `DCGM_FI_DEV_GPU_UTIL`.
Stable	Stable	Healthy system.
Stable	Drops	Anomaly — possible thermal throttling or GPU error. Check temperature and DCGM logs.

Operational Note

Compute Throughput metrics produce values only when there are active requests. At rest, rate() returns 0 or NaN (division by zero when there is no compute time). This is expected behavior, not an error.

3. Applied Statistics: Percentiles in Inference Systems

Never look only at the average: percentiles tell different stories.

p50 (Median)

The central value. Tells the story of the “typical” user. It is the first indicator to normalize when a problem ends. If p50 is good, most users are satisfied.

p99 (Edge Case)

The worst 1% of requests. Reveals bottlenecks, queues and saturation problems that the median hides. It is the KPI that matters for production SLAs.

The Echo Effect

p99 stays high in graphs even after a problem ends because slow data points remain in the temporal aggregation window until they exit the calculation. If the window is 5 minutes, it takes 5 minutes for p99 to normalize.

How to distinguish it from a real problem: if p50 is already back to normal but p99 stays high, it is almost certainly an echo effect. Wait for the window to drain.

Counter Reset vs Echo Effect

A vertical drop in the p99 line can mean two things:

Statistical flush: slow data points have exited the window — the system is back to healthy.
vLLM restart: Prometheus counters were zeroed by a process restart.

Verification query:

resets(vllm:time_to_first_token_seconds_count[5m])

If the result is > 0, there has been a restart. Annotate it to avoid confusing it with an organic recovery.

4. Diagnostic Tree: From Metric to Root Cause

Operational table for incident response. Cross-reference symptoms to identify the cause.

Symptom	Probable Cause	Layer	Immediate Action
High ITL, low TTFT	Slow GPU on decode. Model too heavy or insufficient quantization.	L1	Check GPU utilization (DCGM). Consider more aggressive quantization (FP16 → AWQ/GPTQ 4-bit). Lower max_model_len.
Very high TTFT, low ITL	Queue saturation. Requests wait in Waiting too long.	L3	Check waiting_requests. Reduce concurrent load. Check KV cache usage — if at 100%, prefill is blocked.
High p99, low p50	Healthy system, but outliers (very long prompts, past queue) pollute statistics.	—	Check if echo effect (§3). Check prompt length distribution.
High TTFT, high ITL	Total saturation. GPU at max, queue full, no margin.	L1+L3	Reduce load immediately. Verify whether the model is appropriate for the hardware.
KV Cache > 95%	Risk of preemption or request rejection. Insufficient memory.	L3	Lower max_model_len, reduce concurrency, evaluate smaller model or quantization. Check `vllm:num_requests_preempted` to confirm preemption is active.
GPU Util < 30% with high ITL	Memory bandwidth bottleneck (memory-bound).	L1	RTX 4070 Super has 363 GB/s — for a 7B FP16 (~14 GB) decode is memory-bound. Quantizing reduces the data volume to read.
Preempted > 0 growing	KV cache under active pressure. The engine is evicting requests.	L3	Check KV cache usage. Reduce concurrency or max_model_len. If preemption policy is recompute, TTFT of evicted requests will rise.
Low LMCache Hit Rate (< 0.3) with high KV Cache	The external cache is not serving. Prompts have insufficient prefix overlap.	L3	Check prompt patterns — if very different from each other, prefix cache is ineffective. Evaluate whether LMCache is configured correctly.
Observed Throughput « Compute Throughput	High system overhead (scheduling, queuing, preemption).	L3+L4a	The GPU has free capacity but the system is not using it. Check waiting requests, preemption, continuous batching configuration.

Note on TPOT vs ITL diagnostics: if an external benchmark report (e.g. GuideLLM) shows TPOT rising under load while ITL in the Prometheus dashboards stays stable, this is not an inconsistency: ITL measures in-flight decode, TPOT includes the queue time accumulated in TTFT. The divergence is the expected symptom of queue saturation — use the tree above starting from TTFT and Waiting Requests.

5. Operational PromQL Queries for Grafana

Configuration Principles

The current dashboard uses the following conventions:

Time interval: [15s] for maximum reactivity.
Function: rate on all queries.
Technical note: With a 15s scrape interval, a [15s] window typically contains 1 sample (at the edge). Prometheus requires at least 2 data points to calculate a rate. In practice it works because Prometheus can use the last sample preceding the window, but the behavior is fragile: if a scrape fails, the rate returns NaN. For production environments, evaluate [30s] as a compromise between reactivity and robustness. Annotate the scrape interval in the dashboard.
Instance: All vLLM queries filter by instance="192.168.4.250:8000".

Golden Metrics Dashboard

Important on panel labels: the native metric exposed by vLLM is inter_token_latency_seconds, i.e. ITL — not TPOT. Label the panels as “ITL p99 / p50” (not “TPOT”). To measure TPOT, it must be computed from E2E and output_tokens, and it is not exposed directly by vLLM as a histogram.

TTFT p99:

histogram_quantile(0.99, sum by (le) (rate(vllm:time_to_first_token_seconds_bucket{instance="192.168.4.250:8000"}[15s])))

TTFT p50:

histogram_quantile(0.50, sum by (le) (rate(vllm:time_to_first_token_seconds_bucket{instance="192.168.4.250:8000"}[15s])))

ITL p99:

histogram_quantile(0.99, sum by (le) (rate(vllm:inter_token_latency_seconds_bucket{instance="192.168.4.250:8000"}[15s])))

ITL p50:

histogram_quantile(0.50, sum by (le) (rate(vllm:inter_token_latency_seconds_bucket{instance="192.168.4.250:8000"}[15s])))

E2E Latency p99:

histogram_quantile(0.99, sum by (le) (rate(vllm:e2e_request_latency_seconds_bucket{instance="192.168.4.250:8000"}[15s])))

E2E Latency p50:

histogram_quantile(0.50, sum by (le) (rate(vllm:e2e_request_latency_seconds_bucket{instance="192.168.4.250:8000"}[15s])))

System State Dashboard

Waiting Requests:

vllm:num_requests_waiting{instance="192.168.4.250:8000"}

Running Requests:

vllm:num_requests_running{instance="192.168.4.250:8000"}

Preempted Requests:

vllm:num_requests_preempted{instance="192.168.4.250:8000"}

KV Cache Usage (%):

vllm:kv_cache_usage_perc{instance="192.168.4.250:8000"} * 100

Observed Throughput — Prompt (tokens/sec):

rate(vllm:prompt_tokens_total{instance="192.168.4.250:8000"}[15s])

Observed Throughput — Generation (tokens/sec):

rate(vllm:generation_tokens_total{instance="192.168.4.250:8000"}[15s])

Compute Throughput — Prefill (tokens/sec):

rate(vllm:request_prompt_tokens_sum{instance="192.168.4.250:8000"}[15s]) / rate(vllm:request_prefill_time_seconds_sum{instance="192.168.4.250:8000"}[15s])

Compute Throughput — Decode (tokens/sec):

rate(vllm:request_generation_tokens_sum{instance="192.168.4.250:8000"}[15s]) / rate(vllm:request_decode_time_seconds_sum{instance="192.168.4.250:8000"}[15s])

LMCache Dashboard

LMCache Hit Rate:

sum(rate(vllm:external_prefix_cache_hits_total{instance="192.168.4.250:8000"}[15s])) / sum(rate(vllm:external_prefix_cache_queries_total{instance="192.168.4.250:8000"}[15s]))

LMCache Query Rate:

rate(vllm:external_prefix_cache_queries_total{instance="192.168.4.250:8000"}[15s])

GPU Dashboard (DCGM)

GPU Utilization:

DCGM_FI_DEV_GPU_UTIL{instance=~"${instance}", gpu=~"${gpu}"}

GPU Memory Used:

DCGM_FI_DEV_FB_USED{instance=~"${instance}", gpu=~"${gpu}"}

Note: The current dashboard does not include a panel for DCGM_FI_DEV_GPU_TEMP. Adding the panel is recommended for the pre-benchmark checklist (§9), which requires verifying the temperature baseline.

Node Dashboard (Node Exporter)

CPU Usage (%):

100 * (1 - avg by (instance) (irate(node_cpu_seconds_total{job="gpu-node", mode="idle"}[5m])))

RAM Usage (GB):

(node_memory_MemTotal_bytes{job="gpu-node"} - node_memory_MemAvailable_bytes{job="gpu-node"}) / 1024 / 1024 / 1024

SATA Disk IOPS — Read:

rate(node_disk_reads_completed_total{instance="192.168.4.250:9100", device="sda"}[1m])

SATA Disk IOPS — Write:

rate(node_disk_writes_completed_total{instance="192.168.4.250:9100", device="sda"}[1m])

6. Cross-Verification: Logs vs Charts

The Sanity Check

The textual logs of vLLM are the “real-time truth”. If the log reports TTFT = 140ms but the Grafana chart shows 35s, the problem is the chart’s aggregation window, not the server.

Procedure:

Identify the anomaly in the Grafana chart.
Consult vLLM logs for the same temporal interval.
If logs confirm normal values → echo effect (§3). Wait for window flush.
If logs confirm anomalous values → real problem. Proceed with the diagnostic tree (§4).

Useful Commands

Latest vLLM logs (Docker):

docker logs --tail 100 --timestamps <container_name>

Filter by specific timestamp:

docker logs <container_name> 2>&1 | grep "2025-06-15T14:3"

Check recent restarts:

docker inspect --format='' <container_name>

7. Hardware Operational Thresholds — RTX 4070 Super

Reference values for the GPU node (gpu.dielabs.eu).

Parameter	Value	Alert Threshold	Notes
Total VRAM	12 GB GDDR6X	—	Budget for model + KV cache + CUDA overhead (~300-500 MB).
Memory Bandwidth	363 GB/s	—	Limits decode of memory-bound models (e.g. 7B FP16).
GPU Temp (idle)	~35-45°C	—	Pre-benchmark baseline.
GPU Temp (load)	~65-80°C	> 83°C	Above 83°C the GPU starts thermal throttling, reducing clock and throughput.
GPU Utilization (active inference)	60-95%	< 30% with high ITL	Below 30% with high latency → memory-bound, not compute-bound.
KV Cache Usage	Variable	> 95%	At saturation, preemption or request rejection.
Power Limit	220W (stock)	—	Verify with `nvidia-smi` that it has not been reduced.

8. Incident Response Procedure

Operational workflow when an alert is received or an anomaly is noticed.

Step 1: Check p50 and p99 TTFT + ITL

┌───────────────────────────────────────────────────────────┐
│  p50 OK, p99 high     → Echo effect / outliers            │
│                          Check logs. Wait for flush.      │
├───────────────────────────────────────────────────────────┤
│  TTFT high, ITL low   → Queue saturation                  │
│                          Waiting Requests + KV Cache %.   │
├───────────────────────────────────────────────────────────┤
│  ITL high, TTFT low   → GPU bottleneck                    │
│                          DCGM GPU Util + quantization.    │
├───────────────────────────────────────────────────────────┤
│  Everything high      → Total saturation                  │
│                          Reduce load immediately.         │
└───────────────────────────────────────────────────────────┘

Step 2: Check context metrics: Preempted Requests, LMCache Hit Rate, Observed vs Compute Throughput.

Step 3: Verify vLLM logs for confirmation (§6).

Step 4: Apply action from the diagnostic tree (§4).

Step 5: Monitor recovery — p50 normalizes first, p99 follows after temporal window flush.

9. Pre-Benchmark Checklist

To be executed before every benchmark session to ensure clean and comparable data.

vLLM just started (Prometheus counters zeroed)
No other GPU-intensive process active (nvidia-smi)
Prometheus scrape interval verified and annotated (current: 15s)
Grafana dashboard with [15s] windows + rate
Stable GPU temperature baseline — verify DCGM_FI_DEV_GPU_TEMP (no thermal throttling, see §7)
max_model_len and vLLM parameters documented
Model quantization documented
Warm-up completed (2-3 warm-up requests discarded; criterion: ITL stabilized and GPU temp on plateau)