KV Cache Manual

Version 1.1 — April 2026
Author: Diego @ Dielabs
Stack reference: Dielabs Inference Stack Model (L0–L6)

Why the KV cache is the real bottleneck
Taxonomy: session, turn, request, prefill, decode
The three layers of parameters
Anatomy of the KV cache
Request scope: the default and why it exists
PagedAttention and pool management in vLLM
Behavior under load
Prefix caching
Tuning parameters and operational trade-offs
Policy levers: TTL, quotas, eviction
Management profiles compared
KV cache offload and storage hierarchy
Software ecosystem 2026
The economic cost of long chats
Observability metrics
Scale: from single-node to distributed inference
Relevance for Dielabs and roadmap

1. Why the KV cache is the real bottleneck

In an LLM inference system the temptation is to attribute performance limits to the model weights — how big the model is, how much VRAM it takes. This is a wrong simplification. Model weights are fixed: they take space once and don’t grow. The resource that determines effective concurrency, perceived latency and cost per token is the KV cache.

The KV cache is the structure that lets a transformer model generate tokens efficiently without recomputing the full prior history at every step. It lives in GPU VRAM — the fastest and scarcest space in the system — and it grows with context length and with the number of concurrent requests. When it runs out, the system enters degraded regimes: queueing, preemption, recomputation.

This manual starts from the foundational concepts and arrives at the 2026 software ecosystem (vLLM, LMCache, NIXL, Dynamo, CMX). It deliberately stays neutral with respect to hardware and workload: Dielabs-specific cases are collected in the final section.

2. Taxonomy: session, turn, request, prefill, decode

To reason correctly about the KV cache you have to keep two viewpoints distinct: the application side (what the user sees) and the inference engine side (what vLLM or an equivalent engine sees).

Application side

Level	Description
Session / Conversation	Contains N turns. It’s an application-level concept. The engine has no visibility into the application session.
Turn	A back-and-forth: user message + model response. Each turn generates a request.

Inference engine side

Level	Description
Request	A single HTTP call to `/v1/chat/completions`. The atomic unit for the engine.
Prompt	The content of the request that the model “reads”.
Prefill	The phase in which the engine processes the entire prompt and builds the KV cache.
Decode	The phase in which the engine generates response tokens one by one, using the KV cache.

The envelope paradox

A session contains turns over time, but every request contains the entire session as data. At turn 20, the app re-sends the previous 19 turns inside the prompt of a single request. The engine only sees that request — it does not know it’s the twentieth turn. This distinction is the key to understanding why the KV cache, by default, does not persist across successive turns: for the engine there are no turns, only independent requests.

3. The three layers of parameters

An HTTP request to an inference engine is not governed only by the parameters passed in the body. Three distinct layers contribute to the final behavior.

Layer	Scope	When decided	Examples
Model params	Fixed in the model	At training / model selection	Number of layers, hidden state size, attention heads, vocabulary size
Engine params	Per engine instance	At vLLM launch	`max_model_len`, `gpu_memory_utilization`, `tensor_parallel_size`, `enable_prefix_caching`, `kv_cache_dtype`
Request params	Per single request	At every HTTP call	Prompt, `temperature`, `top_p`, `max_tokens`, `stop`

Model params are not configurable at runtime, but they determine the structural cost of everything else. Engine params define the resource budget. Request params operate inside that budget.

What consumes tokens in the prompt

System prompt, history of previous turns re-sent by the app, current user message, tool/function definitions, RAG context injected by the app, few-shot examples. All of these enter the prefill and occupy KV cache.

What does not consume tokens

Sampling parameters (temperature, top_p, top_k, max_tokens, stop, frequency_penalty, presence_penalty) only influence decode, not prefill. They do not occupy KV cache.

4. Anatomy of the KV cache

During autoregressive generation, for every token the model computes two vectors: K (Key) and V (Value). Without a KV cache the model would have to recompute attention over the entire sequence at every step. The KV cache avoids this recomputation, drastically reducing the incremental cost per token.

The impact is concrete: lower TTFT on long prompts, more stable inter-token latency during generation.

Per-token consumption formula

KV_per_token = 2 × num_layers × num_kv_heads × head_dim × dtype_bytes

The dtype_bytes factor depends on the precision configured for the KV cache (see kv_cache_dtype in §9), which is independent from the model weights’ quantization:

dtype	bytes	Notes
FP32	4	Rarely used in serving
BF16 / FP16	2	Default in most engines
FP8	1	Halves KV consumption — requires Ada Lovelace+
INT8	1	Engine-dependent support

For a 7B model with GQA in float16: 2 × 32 × 8 × 128 × 2 = ~128 KB per token. With kv_cache_dtype=fp8 the same model drops to ~64 KB per token.

Consumption examples at realistic context lengths

At FP16:

Model	1K tokens	10K tokens	100K tokens	Notes
Qwen3-8B	0.15 GB	1.47 GB	14.75 GB	Dense
LLaMA 3.3 70B	0.33 GB	3.28 GB	32.77 GB	Dense
LLaMA 3.1 405B	0.52 GB	5.16 GB	51.6 GB	Dense
DeepSeek-R1 685B	0.07 GB	0.7 GB	7 GB	MoE

Box — KV cache in MoE models. In MoE (Mixture of Experts) models, the per-token cost depends on the architecture actually active. In many cases the KV footprint can be smaller than equivalent dense models, but it has to be verified model by model.

These numbers explain why the KV cache — not the model weights — is the real bottleneck on concurrency. At 8192 tokens of context a 7B model already takes ~1 GB of KV per single active sequence.

5. Request scope: the default and why it exists

Without optimizations at L4a/L4b in the Dielabs Inference Stack Model, the KV cache is strictly request-scoped:

Born at prefill
Serves the decode (avoiding recomputation of attention over all prior tokens at every new token generated)
Dies at the end of the response

It does not persist across successive requests, not even within the same active session. From the inference engine’s point of view there is no difference between an “active session” and a “historical session”: both are new requests with full prefill.

This is the default — and it’s a coherent design choice. Keeping the KV cache of millions of inactive sessions in VRAM is economically unsustainable. Everything we will see in sections 8, 12, 13 (prefix caching, offload, LMCache, Dynamo) is a layer on top of this default to selectively extend the cache lifetime where it pays off.

6. PagedAttention and pool management in vLLM

In a naive implementation, the KV cache is a contiguous block of memory pre-allocated per request. This causes high fragmentation and limits concurrency. PagedAttention is the innovation that defined vLLM: the KV cache is not allocated as a monolithic block but in pages (blocks) managed dynamically, in analogy with the virtual memory of an operating system.

What occupies VRAM at boot

At startup, VRAM is split into three components:

Model weights — fixed, determined by size and quantization (~5–6 GB for a 7–8B model in AWQ 4-bit)
KV cache pool — fixed after boot, but the internal distribution changes dynamically
System overhead — CUDA driver, PyTorch, GPU processes: about 0.5–1 GB

System overhead is the reason gpu_memory_utilization is never set to 1.0.

The two phases of management

Phase 1 — Pool reservation (at boot). vLLM pre-allocates a fixed pool of KV blocks. The pool is fixed at boot and does not vary during the lifetime of the process.

KV Cache Pool = (total_VRAM × gpu_memory_utilization) − model_weights_VRAM

If the model is too large or gpu_memory_utilization is too high, vLLM fails at boot — not at runtime. If it starts, the system is stable.

Phase 2 — Dynamic allocation (at runtime). Within the fixed pool, PagedAttention assigns blocks dynamically:

Every active sequence receives blocks on demand, not in advance
The sequence grows block by block (typically 16 tokens at a time)
When it terminates, the blocks are freed and reassigned

vLLM does not reserve space for max_model_len per request. If a request generates 200 tokens instead of 8192, it occupies ~13 blocks instead of 512.

Analogy. The pool is a parking lot with N spots. Every car (request) does not reserve N spots at the entrance: it occupies one at a time as it advances. If the cars are small, far more than N can fit in the lot.

Concrete effects of PagedAttention

Lower fragmentation → GPU memory is not wasted in oversized blocks
Higher concurrency → more requests can coexist on the same GPU
Technical foundation for prefix reuse → identical KV pages can be shared across requests

Worst-case concurrency in the logs

The value shown at startup (e.g. “Maximum concurrency for 8,192 tokens: 2.41x”) is the worst case: how many requests would coexist if each used the full context window. In practice, with short prompts and limited output, effective concurrency is much higher.

7. Behavior under load

The scheduler of an inference engine decides at every step which requests to admit and which to hold back. KV cache saturation is only one of the constraints.

Why requests wait

A request can stay in the queue even with available blocks. The scheduler evaluates multiple constraints:

Constraint	What it controls
KV pool saturation	Available physical blocks
`max_num_seqs`	Concurrent sequence limit in the batch
`max_num_batched_tokens`	Token budget per iteration (especially prefill)
Chunked prefill	Latency protection for sequences in decode

In a well-sized configuration, waiting is typically caused by batch and token-budget limits, not by memory. Waiting is part of the admission control mechanism: the client simply waits.

Preemption: the last resort

When sequences in the batch grow faster than they can be released, waiting is not enough. The scheduler resorts to preemption: it suspends active requests, frees (or swaps to CPU) their KV blocks, and resumes them when possible.

Preemption is expensive (it requires recomputation or reload) and is the signal that the configuration is undersized for the load.

Intervention hierarchy

Admission control (waiting)        ← normal behavior
        ↓ insufficient
KV cache pool saturation (waiting) ← memory constraint
        ↓ insufficient
Preemption (swap/recompute)        ← last resort, to be avoided

Operational rule: in a well-configured system, num_preemptions_total stays at zero.

8. Prefix caching

When two requests share the same prefix, the engine can reuse already-computed KV blocks instead of recomputing them. This is the first optimization that partially breaks request scope.

Content-based, not session-aware

Prefix caching is not tied to the session. It does not know who you are, it does not know you are continuing a chat. It performs an exact token-level match of the prefix of the current request against prefixes already in cache, typically via content hashing (SHA-256).

Implications:

If two different users send the same system prompt, the prefix is shareable
If you change a single comma in the system prompt compared to the previous request, the match misses
Reuse is purely content-based: same bytes/tokens → hit, any token-level difference invalidates the match

Opportunistic lifetime

KV pages in cache do not have a configured TTL. Their lifetime is opportunistic, governed by:

GPU memory pressure (more concurrent requests → more eviction)
Eviction policy (LRU-like in vLLM)
Model size and gpu_memory_utilization

Under low load, pages can persist for minutes. Under intense load they are evicted in seconds. There is no guarantee of reuse.

Impact on benchmarking

Prefix caching distorts TTFT measurements: if you use the same prompt on multiple runs, the second run has an artificially low TTFT.

Strategy: vary the prompt across runs, or restart the container between one model and the next to reset the pool. Monitor prefix_cache_hits_total / prefix_cache_queries_total — ideally below 5% during benchmarks.

9. Tuning parameters and operational trade-offs

The two most relevant parameters for tuning — max_num_seqs and max_model_len — protect different aspects of the system:

Parameter	Mainly protects	Sacrifices
`max_num_seqs` ↓	Latency (per-request ITL) — fewer sequences in the batch → less GPU contention	Aggregate throughput
`max_model_len` ↓	Pool / sustainable throughput — each sequence can grow less → more sequences in the pool	Long-context capacity

In practice, max_num_seqs is the main operational lever: you use it to decide whether you are optimizing for latency (low value) or throughput (high value). max_model_len is set once based on the workload profile and left fixed.

gpu_memory_utilization

Percentage of VRAM used as budget (weights + KV pool). The most important parameter. It only reduces the available KV budget — it does not free VRAM from the weights. Setting it too high causes OOM at boot.

--gpu-memory-utilization 0.85   # recommended on 12 GB consumer

kv_cache_dtype

Controls the precision of K/V vectors in the pool, independently from the model weights’ quantization. With fp8 the per-token KV consumption is halved, effectively doubling the pool capacity at the same VRAM.

--kv-cache-dtype auto   # default: follows model dtype
--kv-cache-dtype fp8    # halves KV consumption — requires Ada Lovelace+ architecture

The impact on generation quality is generally negligible, but it has to be validated for the specific model.

max_model_len

Maximum context window per request (input + output). Reducing it does not change the pool size, but it increases theoretical concurrency because every request can grow less.

Value	Trade-off
8192	Supports long conversations, low worst-case concurrency
4096	Good compromise for latency/throughput benchmarking
2048	Maximizes concurrency, but truncates long contexts

max_num_seqs

Hard cap on simultaneously active sequences. Even if the pool would have room for more requests, serving too many in parallel degrades the ITL of each.

--max-num-seqs 256   # vLLM default (not recommended for benchmarks)
--max-num-seqs 64    # controlled benchmark
--max-num-seqs 8     # single user, optimal latency

max_num_batched_tokens and chunked prefill

Limits the tokens processable in a single step. Chunked prefill (active by default since vLLM 0.15.x) splits long prompts into chunks to prevent a single prompt from blocking all other requests.

Upstream guardrails

Engine parameters operate inside the backend. A complete strategy includes upstream limits: rate limiting (nginx, Traefik, LiteLLM), reject vs queue policy, protection against infinite chats (cap max_tokens per response, truncate history client-side, monitor request_prompt_tokens).

10. Policy levers: TTL, quotas, eviction

So far we’ve discussed engine behavior. Above the engine there is a policy dimension that decides how the KV cache is managed with respect to users, not with respect to the GPU. These policies are not handled by the inference engine but by upper layers — gateway, orchestrator or cache manager (typically L4a/L4b in the Dielabs Inference Stack Model).

How long to keep the cache (TTL). Low TTL means more repeated prefills but free VRAM and a stable system. High TTL means fewer prefills and better latency for the user, but VRAM occupancy and the risk of forced eviction.

How much to keep per user (quotas). Low quotas guarantee fairness and predictability. High quotas improve the experience for intensive users but risk a few users consuming all available VRAM.

What to keep. Prefix caching — keeping only the cache of static prefixes such as the system prompt — is almost always advantageous: low cost, guaranteed reuse. Full session caching, keeping the entire conversation, only pays off if the session is active and continuous.

Where to keep it. VRAM is optimal but scarce. System RAM offers more capacity but introduces transfer penalties to the GPU. Disk is viable only where latency is not critical. This hierarchy is formalized in §12.

Eviction policy. When VRAM saturates, you have to decide what to evict. LRU and TTL are simple and robust. More sophisticated policies keep in memory those who are actively generating or those with high probability of returning soon. In enterprise systems you can define service classes — critical users with higher TTL and lower eviction priority.

11. Management profiles compared

There is no universal configuration. Context determines everything. Three representative profiles:

Hyperscale — providers with millions of simultaneous users. Stateless by default, opportunistic prefix caching, maximum batching. The cache is freed after every response and rebuilt on the next turn. You pay in compute, you gain in scalability.

On-prem enterprise — a hundred employees, dedicated server, predictable load. Stateful with a window: TTL of 10–20 minutes, per-user quotas, LRU eviction, sticky routing to send each user to the same node where their cache is already present.

UX-first — few users, latency is the priority. Long TTL (60+ minutes), high quotas, reduced concurrency. You maximize the quality of the individual experience at the expense of hardware density.

Recommended benchmark

To find the optimal point in your own context, define the three profiles and measure them:

Profile 1 — Hyperscale-like: zero TTL, only prefix caching active
Profile 2 — Enterprise balanced: TTL 10–20 minutes, per-user quota, LRU eviction
Profile 3 — UX-first: TTL 60 minutes, higher quotas, reduced concurrency

Four KPIs: TTFT, tokens/s, peak VRAM utilization, cache hit-rate. Two hours of benchmarking yield a real curve of the trade-off.

12. KV cache offload and storage hierarchy

The key insight is simple: instead of evicting the KV cache and losing it, you move (offload) it to more capacious storage. When the session resumes, the system reads the KV cache from storage instead of recomputing it.

The necessary condition for this to make sense:

Storage read time < Prefill recomputation time

For long contexts this equation tends to be favorable — recomputation is too expensive. For short contexts the advantage is marginal.

Tier hierarchy (NVIDIA nomenclature)

Tier	Layer	Speed	Capacity	Use
G1	GPU VRAM (HBM)	~2 TB/s	12–192 GB	Active sessions
G2	CPU RAM (DRAM)	~100 GB/s	128–512 GB	Recently idle sessions
G3	Local NVMe SSD	~10–50 GB/s	TB-scale	Intermediate cache, pre-staging
G3.5	CMX (ICMS)	~10–50 GB/s	TB-scale	Pod-level dedicated tier
G4	Network storage (RDMA, S3)	~10–40 GB/s	Petabyte	Archived sessions, shared cache

The G3.5 tier (CMX) was introduced by NVIDIA at CES 2026 as a dedicated layer: NVMe flash managed by BlueField-4 DPU, optimized specifically for ephemeral KV cache within an inference pod.

Lifecycle with offload

With an active management layer, the KV cache follows a configurable lifecycle instead of being evicted and lost:

Active conversation → KV in VRAM (G1)
Idle conversation → offload to CPU RAM (G2), VRAM freed
Long-paused conversation → migration to storage (G3/G4), CPU RAM freed
Resumed conversation → read from storage and reload to VRAM, no prefill recomputation
Expired retention → permanent deletion

13. Software ecosystem 2026

The ecosystem for KV cache management has expanded significantly compared to the simple vLLM stack. Four main components, each on a different layer of the Dielabs Inference Stack.

Inference Engines (L3)

vLLM — open source inference engine, manages the KV cache in VRAM via PagedAttention. Default behavior: aggressive eviction. Includes integration with LMCache via the KV connector interface.

SGLang — alternative to vLLM, supports the same offload mechanisms. To be considered for specific scenarios (RadixAttention for efficient prefix sharing).

KV Cache Management — LMCache (L4b)

A management layer that sits on top of inference engines. It adds configurable retention policies, RAM/storage hierarchy management, and the mechanism to recover the KV cache instead of evicting it. Open source, part of the PyTorch Foundation ecosystem.

Key features:

Native integration in vLLM via the kv-connector interface
Multimodal support (caching of vision tokens via mm_hashes hashing)
S3 backend via AWS CRT
Disaggregated prefill/decode with NIXL
Content-addressed storage (hash-based lookup, no centralized cache map)

Data Transport — NIXL (L4b)

NVIDIA Inference Transfer Library. It is not limited to RDMA: it provides a unified vendor-agnostic API that abstracts transfers between GPU, CPU and storage on heterogeneous backends (RDMA, GPUDirect Storage, NVMe-oF, object storage). Supported by AWS, Azure and Google Cloud. Includes dedicated benchmarking tools (NIXLBench for raw metrics, KVBench for LLM-specific profiling).

Orchestration — NVIDIA Dynamo (L4a)

Open source distributed inference serving framework, version 1.0 at GTC 2026. It positions itself as an orchestration layer on top of inference engines — it does not replace them, it coordinates them in a multi-node system.

Key features:

Disaggregated serving: prefill/decode separation on distinct GPUs
KV-aware routing: routes requests toward GPUs that already have the KV cache
KV Block Manager (KVBM): memory management across the tier hierarchy
Dynamic GPU scheduling: resource allocation based on real-time demand
Multimodal: disaggregated encode/prefill/decode with embedding cache

Compatible with vLLM, SGLang, TensorRT-LLM. Adopted by AWS, Azure, GCP, OCI, Perplexity, PayPal, Pinterest, ByteDance.

Infrastructure — NVIDIA CMX (L0–L1)

AI-native storage platform announced at CES 2026. It introduces a dedicated tier (G3.5) for ephemeral pod-level KV cache, based on NVMe flash managed by BlueField-4 DPU (availability planned for H2 2026), DOCA Memos as the SDK for management, Spectrum-X as the RDMA-accelerated fabric, and STX as the modular reference architecture. NVIDIA-claimed results: up to 5x throughput and 5x power efficiency vs traditional storage.

Relationship between components

NVIDIA Dynamo (orchestration — routing, scheduling, disaggregation)
  ├── vLLM / SGLang / TensorRT-LLM (inference engine — generates tokens)
  ├── LMCache (KV cache management — retention, offload, sharing)
  ├── NIXL (data transport — moves KV cache between GPU, CPU, storage)
  └── CMX / BlueField-4 (storage infrastructure — dedicated G3.5 tier)

Mapping to Dielabs layers:

L0 (Hardware): GPU, CPU, NVMe SSD, BlueField-4 DPU, Spectrum-X
L1 (Driver/GPU Runtime): CUDA, DOCA, GPUDirect
L3 (Inference Backend): vLLM, SGLang, TensorRT-LLM
L4a (Request Orchestration): Dynamo
L4b (GPU Workload Optimization): LMCache, NIXL

14. The economic cost of long chats

Prefill has computational cost approximately O(n) with respect to prompt length (where n = number of input tokens). Since at every turn the app re-sends the entire history, the prefill cost grows linearly with the number of turns.

Turn	Tokens in history (example)	Relative prefill cost
1	~200 (system + message)	1×
5	~2,000	10×
20	~10,000	50×
50	~25,000	125×

Without prefix caching, every turn pays the full cost. Direct implications:

Token economics: per-turn cost is not constant. Providers bill input tokens → per-turn cost grows linearly, but cumulatively in a significant way over multi-turn conversations
Agent design: multi-step agents with many turns accumulate significant prefill costs. Context window design (what to keep, what to discard, what to summarize) becomes an economic choice, not just a technical one
Window management: strategies like sliding window, summarization of old turns, or history truncation exist precisely to contain this growth

The intra-request KV cache masks the problem during decode, but does not eliminate it: prefill is the structural cost of long conversations. The optimizations of §8 (prefix caching) and §12–13 (offload, LMCache) are ultimately attempts to break this linear growth.

15. Observability metrics

The Prometheus metrics exposed by vLLM (and by compatible engines) are the fundamental observation point.

Metric	Type	What to observe
`kv_cache_usage_perc`	Gauge	> 80% sustained → saturation risk
`num_requests_running`	Gauge	Compare with configured `max_num_seqs`
`num_requests_waiting`	Gauge	> 0 sustained → system under pressure (verify with `kv_cache_usage_perc` to distinguish from normal admission control)
`num_preemptions_total`	Counter	> 0 → configuration to be revisited
`prefix_cache_hits_total` / `prefix_cache_queries_total`	Counter	Hit rate — during benchmark ideally < 5%
`time_to_first_token_seconds`	Histogram	Monitor p50 and p99; large gap = variance from concurrency or prefix caching
`inter_token_latency_seconds`	Histogram	Rises with concurrency: more requests → more crowded decode batch
`e2e_request_latency_seconds`	Histogram	Sum of queue time + prefill + decode

Patterns to recognize:

High kv_cache_usage_perc + num_requests_waiting > 0 + stable num_preemptions_total → system in controlled saturation, probably to relax with max_num_seqs or max_model_len
num_preemptions_total growing → undersizing, urgent intervention
time_to_first_token_seconds p99 much greater than p50 → high variance, probably noisy prefix cache hits/misses or full queues

16. Scale: from single-node to distributed inference

The mechanism is identical at all scales. What changes is absolute scale and the additional optimizations available.

Aspect	Consumer single-node	Datacenter single-node	Datacenter multi-node
Reference GPU	RTX 4070/4090 (12–24 GB)	H100 (80 GB)	H100/B200 cluster
KV Cache pool	~5 GB	~50–60 GB	TB-scale aggregate
Worst-case concurrency (8K tok)	2–3 requests	50–100+	Thousands
Tensor Parallelism	Not applicable	Optional	Standard
KV cache offload	Not necessary	Optional	Critical
Disaggregated prefill/decode	Not applicable	Not applicable	Standard with Dynamo
Orchestration layer	Direct	Optional	Required

The single-node ceiling is physical, tied to available VRAM: VRAM out, concurrency out. You don’t go past it without adding GPUs. The transition from single-node to multi-node is not an incremental upgrade: it introduces an orchestration layer (Dynamo or equivalent) that changes the mental model of request management.

17. Relevance for Dielabs and roadmap

The current lab setup (single-node vLLM, consumer GPU, mostly single-user traffic) does not yet justify LMCache, Dynamo, or KV cache offload. The immediate eviction observed in Grafana is the correct behavior for this context — there isn’t enough concurrency to justify the extra complexity.

When it will become relevant

When the lab gets real multi-user traffic with concurrent sessions
When RAG workloads with very long contexts (>10,000 tokens) are explored
When agentic workflows with extended multi-turn conversations are implemented
When a second GPU node is added and disaggregated prefill/decode can be experimented with
When the robotics project with Qwen3-VL uses repeated vision sessions — LMCache supports caching of vision tokens, reducing TTFT for recurring webcam images

Dielabs roadmap

Short term: keep the current vLLM setup as a baseline, document the prefix caching behavior across existing benchmarks
Medium term: experiment with LMCache as an additional layer on top of vLLM, starting from official documentation and the Dell Technologies reference as a methodological baseline
Medium-long term: study NVIDIA Dynamo as the framework for disaggregated serving in view of an eventual second GPU node
Market observation: monitor the availability of CMX/BlueField-4 (H2 2026) as a potential professional-positioning topic — the chain NIXL → Dynamo → CMX → BlueField-4 is an area where infrastructure engineering and inference engineering competencies converge

Useful references

LMCache: github.com/LMCache/LMCache docs.lmcache.ai Paper arXiv:2510.09665
NIXL: github.com/ai-dynamo/nixl
Dynamo: github.com/ai-dynamo/dynamo developer.nvidia.com/dynamo
Dell Technologies — Scaling Multi-Turn LLM Inference with KV Cache Storage Offload (January 2026)
VAST Forward 2026 — Breaking Through the GPU Memory Wall

dielabs.github.io · github.com/Dielabs

KV Cache Manual

Table of Contents

1. Why the KV cache is the real bottleneck

2. Taxonomy: session, turn, request, prefill, decode

Application side

Inference engine side

The envelope paradox

3. The three layers of parameters

What consumes tokens in the prompt

What does not consume tokens

4. Anatomy of the KV cache

Per-token consumption formula

Consumption examples at realistic context lengths

5. Request scope: the default and why it exists

6. PagedAttention and pool management in vLLM

What occupies VRAM at boot

The two phases of management

Concrete effects of PagedAttention

Worst-case concurrency in the logs

7. Behavior under load

Why requests wait

Preemption: the last resort

Intervention hierarchy

8. Prefix caching

Content-based, not session-aware

Opportunistic lifetime

Impact on benchmarking

9. Tuning parameters and operational trade-offs

gpu_memory_utilization

kv_cache_dtype

max_model_len

max_num_seqs

max_num_batched_tokens and chunked prefill

Upstream guardrails

10. Policy levers: TTL, quotas, eviction

11. Management profiles compared

Recommended benchmark

12. KV cache offload and storage hierarchy

Tier hierarchy (NVIDIA nomenclature)

Lifecycle with offload

13. Software ecosystem 2026

Inference Engines (L3)

KV Cache Management — LMCache (L4b)

Data Transport — NIXL (L4b)

Orchestration — NVIDIA Dynamo (L4a)

Infrastructure — NVIDIA CMX (L0–L1)

Relationship between components

14. The economic cost of long chats

15. Observability metrics

16. Scale: from single-node to distributed inference

17. Relevance for Dielabs and roadmap

When it will become relevant

Dielabs roadmap

Useful references