Disaggregated Inference


1. Foundation: Prefill and Decode Are Two Different Phases

Autoregressive LLM generation consists of two phases with opposite computational profiles.

1.1 Prefill

The model receives the entire prompt and processes it in a single parallel forward pass. For every token in the prompt it computes the Key (K) and Value (V) vectors of the attention mechanism and stores them in the KV cache. The output of this forward pass is the first generated token — there is no separate “generation start” step. The first token is the direct product of reading the prompt, exactly as a model in training predicts the next token given a sequence.

TTFT (Time To First Token) measures the duration of this phase.

Profile: compute-bound. High density of matrix operations, all tokens processed in parallel. Benefits from hardware with high compute capacity (FLOPS). An H100 in prefill saturates the tensor cores with massive matmuls.

1.2 Decode

From the second token onwards, the model generates one token at a time. At each step it computes Q, K and V only for the new token, appends K and V to the cache, and computes the attention of the new token against all previous K and V already in the cache. The forward pass processes a single token, not the entire sequence.

TPOT (Time Per Output Token) is the average time per token in this phase. ITL (Inter-Token Latency) is the punctual latency between consecutive tokens.

Profile: memory-bandwidth-bound. The bottleneck is reading from memory at every step: both the model weights (which must be read for every forward pass) and the KV cache (which grows with the sequence and becomes the dominant factor on long contexts). Compute per single token is minimal — time is dominated by data movement. An H100 in decode can stay below 20% compute utilization while HBM bandwidth is saturated. Benefits from hardware with high memory bandwidth.

1.3 Why the KV Cache Is Essential

Without KV cache, every decode step would be equivalent to a full prefill on the entire sequence accumulated up to that point: to generate token N, the model would have to re-read and recompute K and V for all preceding tokens (prompt + N-1 generated tokens). Each step costs O(n) in compute (attention over n tokens), and there are n steps → cumulative cost grows as O(n³) with sequence length.

The KV cache eliminates the recomputation of K and V: values for previous tokens are already in memory, each step processes only the new token. Attention remains O(n) per step (the new token still has to “look at” all previous ones in the cache), but per-step cost is dominated by reading from the cache, not by recomputation. Total cost drops to O(n²) — a change in order of magnitude, not just a lower constant. This transforms decode from compute-bound (many repeated prefills) to memory-bound (reading model weights and cache), making autoregressive inference practical.

The KV cache is not an optimization added a posteriori by inference servers. It is a direct consequence of the math of attention in transformers: K and V of previous tokens do not change when a new token is added, so recomputing them is pure compute waste. The innovation of inference engines (e.g. PagedAttention in vLLM) lies in the efficient management of the cache, not in its existence.


2. The Conflict: Two Profiles on a Single Piece of Hardware

In the traditional configuration, prefill and decode happen on the same GPU. This creates a structural compromise on two levels, and the second is architecturally more important than the first.

Level 1 — Instantaneous utilization mismatch. Prefill wants all available compute capacity; decode wants all memory bandwidth. The two phases compete for the same resources (compute units, memory, bandwidth, batch scheduler). With continuous batching (e.g. vLLM), requests in prefill and requests in decode coexist in the same batch. This improves GPU utilization but creates interference, not only at the compute level but especially at the scheduling and memory bandwidth level: requests in prefill and decode compete for access to weights and KV cache in the same execution cycle. Prefill requests (compute-intensive) slow down decode ones (latency-sensitive) and vice versa. Continuous batching mitigates but does not eliminate.

Level 2 — Independent scaling patterns. Prefill scales with the volume of incoming prompts (length × request rate), decode scales with the number of concurrent sessions and their generation length. A workload with long contexts is prefill-heavy; a conversational workload with long replies is decode-heavy. Sizing them together means oversizing one of the two to cover the peak of the other.

Disaggregated serving is born as a structural response: physically separate the two phases on dedicated hardware, scaled independently.


3. Three Architectures Compared

Given a budget of two GPUs, there are three architecturally distinct ways to use them to serve an LLM model.

3.1 Single Node

A single GPU handles the entire prefill + decode cycle. The second GPU stays unused. It is the minimum baseline, useful as a reference to measure throughput per hardware unit.

Flow: Client → vLLM Instance (prefill + decode) → Response

3.2 Replicated Serving

Two GPUs each run an independent and complete instance of the model. A stateless proxy distributes requests in round-robin. Each instance handles prefill and decode internally.

Flow: Client → Proxy (round-robin) → Instance A or B (prefill + decode) → Response

The proxy is simple: it keeps no state, does not know request content, just alternates between the two backends. Additional cost over single node is exclusively the proxy hop (~6% measured overhead).

3.3 Disaggregated Serving

The two GPUs specialize: one is dedicated to prefill, the other to decode. The KV cache generated by the prefill node is transferred to the decode node, which uses it for token generation.

Flow: Client → Proxy → Prefill Node → [KV cache transfer] → Decode Node → Response

Here the proxy becomes a stateful component that orchestrates a two-phase protocol:

3.4 Specific Advantages of Disaggregation

Independent hardware optimization. Each pool uses hardware suited to its profile. Prefill nodes can be dense GPUs (e.g. H100 SXM with high compute), decode nodes can be GPUs with high HBM bandwidth or even alternative hardware optimized for sequential reads.

Independent scaling. Prefill is bursty (a compute peak, then it ends), decode is prolonged. The two pools can be scaled differently based on load. Prefill nodes can even run on spot/preemptible hardware, since their work is atomic.

Interference elimination. No contention between prefill and decode requests in the same batch. Each pool handles a single workload type.

Differentiated pricing. Input tokens and output tokens have different production costs. Physically separating them makes pricing more transparent and aligned to real costs. This is already the norm in commercial APIs (e.g. input tokens cheaper than output tokens).


4. The Critical Role of the Interconnect

Disaggregation introduces an architectural cost not present in the other configurations: the KV cache transfer between nodes. End-to-end transfer must be kept in the millisecond range or below — beyond that, disaggregated becomes worse than colocated. This is the operational constraint that dominates the entire data path design.

4.1 KV Cache Size: Orders of Magnitude

Size is computable from model parameters with the formula:

KV_size = layers × kv_heads × head_dim × seq_len × bytes_per_element × 2  (K + V)

Two examples at the extremes of the operational spectrum:

Model Context Precision KV cache per request
Llama-3.1-8B (GQA, 8 KV heads, 32 layers, head_dim 128) 512 tokens FP16 ~32 MB
Llama-3-70B (GQA, 32 effective layers per attention, standard configuration) 32k tokens FP16 ~10 GB

With quantized KV (FP8, INT4) size drops by 2-4×. The order of magnitude is still such that the transfer must be fast, because it adds directly to TTFT — the KPI the end client perceives.

4.2 Bandwidth and Latency: Operational References

Technology Bandwidth Latency (32 MB) Impact
RDMA (RoCE / InfiniBand) 100-800 Gbps < 1 ms Negligible
Standard TCP/IP 1-25 Gbps 10-100 ms Significant

Additional references:

Conclusion: RDMA is an architectural prerequisite for medium/large models with long contexts. On small models (~7B) with short contexts, even 10-25 Gbps TCP can be sufficient. But without a high-speed, low-latency interconnect, the cost of KV cache transfer can exceed the gain from hardware specialization, making disaggregation a more complex architecture that performs worse than simple replication.

4.3 NIC vs DPU: What’s Actually Needed on the Data Path

The marketing narrative tends to present the DPU as a necessary condition for disaggregated inference. It is not. The correct framing is this:

For the pure KV-cache data path, an RDMA-capable NIC is enough. The transfer between GPUs of different nodes goes via GPUDirect RDMA:

GPU_A → NIC_A → fabric → NIC_B → GPU_B

without involving the host CPU or system memory. NVIDIA ConnectX-6/7/8 do this just fine. The DPU (BlueField) is not required to enable this pattern.

The DPU (BlueField) adds services on the data path, not the data path itself:

Useful in shared cloud environments and managed platforms; less relevant on a dedicated single-tenant training cluster.

4.4 The Fabric: Spectrum-X as a System

NVIDIA Spectrum-X is a fabric-level platform, not a single-DPU component. It combines Spectrum-4 switches, NICs (ConnectX-8 or BlueField-3/4) and software stack that together implement adaptive routing and congestion control optimized for RoCE.

The NIC/DPU participates in the system, but the value is in the switch ↔ NIC interaction, not in the DPU taken in isolation.

For disaggregated inference this matters because the KV-cache transfer traffic pattern is bursty and many-to-many — exactly the case where classical ECMP collapses and adaptive routing is required. On standard RoCE fabrics without adaptive routing, real throughput under synchronous bursts is typically much lower than nominal due to congestion on shared paths.


5. Storage Tier and KV Cache Pooling

Beyond direct prefill → decode transfer, an architectural pattern relevant from 2025-2026 is KV-cache pooling: instead of transferring the KV-cache directly between nodes, it is written to a tier of shared remote memory, where it can be reused cross-request (cache hit when the same prompt or prefix appears again — e.g. long system prompts, recurring RAG documents, multi-turn sessions).

5.1 NVMe-oF as the Dominant Implementation

Today the pattern is typically implemented via NVMe over Fabrics. It is the choice of Mooncake (Kimi/Moonshot) in production and is mature engineering.

Here the DPU is actually enabling, not just useful: it can be the NVMe-oF target in hardware, exposing to the decode node “virtual” NVMe devices that actually point to remote memory (shared pool or dedicated storage tier). Doing it in software on the host CPU would be prohibitive in terms of latency and CPU overhead.

It is the same pattern that Lambda Labs (and AWS Nitro, GCP, OCI) use for bare metal instance volumes, applied to the KV-cache.

5.2 CXL as Direction, Not State of the Art

Memory pooling via CXL 3.0 over fabric is conceptually cleaner than NVMe-oF (memory-semantic access instead of block-semantic, theoretically lower latency). As of 2026 it remains predominantly direction rather than widespread practice in production for inference.

It is worth mentioning as a roadmap in conversations with customers/prospects, not as the current state of the art. Anyone presenting it as deployment-ready today is selling vaporware.

5.3 Economic Implications of Pooling

KV-cache pooling changes the economics of inference in non-trivial ways:

Whoever designs disaggregated infrastructure should treat the pooling tier as a first-class component, not as an optional optimization.


6. Performance Analysis

6.1 The Fair Baseline Question

Most published benchmarks on disaggregation compare 2 disaggregated GPUs against 1 single-node GPU. This comparison is methodologically incorrect: you are doubling the hardware and declaring it goes faster.

The correct comparison is: given a fixed budget of N GPUs, which architecture (replication vs disaggregation) performs better? Only this comparison reveals whether disaggregation offers a real architectural advantage or simply a hardware advantage.

This distinction is often ignored in blog posts and papers, not necessarily in bad faith, but because the focus is to demonstrate that disaggregation works, not to quantify when it is worth it. For those who must make infrastructure decisions, it is a critical gap.

6.2 Single Request (Latency Test)

With a single request, there is no concurrency and therefore no possibility to overlap pipeline phases. The expected result is that disaggregation is the slowest configuration:

This test measures the worst case of disaggregation and it is essential to include it: it is the cost paid always, even when the system is idle.

6.3 Under Concurrent Load (Throughput Test)

Under concurrent load, the pipeline activates. While the decode node generates tokens for request N, the prefill node is already processing request N+1. The two GPUs work in parallel on different phases of different requests.

Reference data (Thomas/TokenLabs benchmark, 8 concurrent requests, Llama-3.1-8B):

Configuration Throughput (tok/s) Note
Single Node Baseline 1 active GPU
Replicated 26.5 tok/s 2 GPUs, round-robin
Disaggregated 101.9 tok/s 2 GPUs, prefill/decode split

The 26.5 tok/s replication figure is notable: it indicates that at 8 concurrent requests, round-robin between two nodes does not scale linearly. Each node handles 4 requests internally, but contention between prefill and decode in the same GPU caps throughput. Replication does not eliminate the prefill/decode conflict — it replicates it across more nodes. Each node continues to suffer the same internal contention, limiting real scalability.

Reference papers (DistServe, Splitwise, Mooncake) report throughput improvements at equivalent SLO in the order of 2-4× over colocated on large models, provided the transfer does not degrade TTFT.

6.4 The Missing Crossover Point

The benchmark presents only two points: 1 and 8 concurrent requests. This demonstrates that the crossover between replication and disaggregation exists, but does not identify where it falls. A curve with 1, 2, 4, 8, 16, 32 requests would have allowed determining the exact concurrency level at which disaggregation overtakes replication — critical information for capacity planning.


7. Speculative Decoding in Disaggregated Context

Speculative Decoding (SD) and disaggregated serving are complementary.

7.1 The SD Problem in Traditional Configuration

SD uses a small draft model to propose N candidate tokens, which the target model verifies in a single forward pass (mechanically identical to a mini-prefill incremental over the candidate tokens). This multiplies decode throughput.

However SD has costs:

For short outputs, the cost of double prefill can exceed the gain in decode. For long outputs, the decode gain widely compensates.

7.2 SD + Disaggregated: The Natural Complement

With disaggregated serving, SD initialization overhead (target prefill + draft priming, in whatever form the implementation requires) happens on prefill nodes optimized for compute. Additional cost is absorbed by hardware designed for that workload type.

Decode with SD happens on decode nodes, where the draft model proposes and the target verifies, producing token bursts. Decode nodes are optimized for bandwidth — exactly what is needed to read the KV caches of both models quickly.

Result: SD’s weak point (degraded TTFT) is not eliminated — initialization cost still exists — but is shifted to compute-optimized hardware where it is amortized far more efficiently. The strong point (accelerated decode) remains intact on decode nodes.


8. Implications for Hybrid CPU+GPU Architectures

Disaggregated serving opens interesting scenarios with heterogeneous hardware.

8.1 CPU as Decode Node — Limited Scenario

Server CPUs (e.g. dual Xeon with high DDR4/DDR5 capacity) have large RAM capacity but significantly lower bandwidth than HBM: DDR4 at ~68 GB/s per socket vs HBM3 at ~3 TB/s on H100. Memory latency is also higher, and the massive parallelism of GPUs is missing. CPU decode quickly becomes limited not only by bandwidth, but also by the lack of vector parallelism comparable to GPUs, making it difficult to sustain high throughput even with sufficient memory.

This makes the CPU a decode candidate only in specific scenarios:

A hybrid architecture in these scenarios could provide:

8.2 Dielabs Context

The Dielabs lab has:

The bottleneck in this configuration would be the KV cache transfer from the GPU node to the R730 over the local network. With a 1GbE network the transfer would be prohibitive; with 10GbE it becomes feasible for small models. For experimentation, a pragmatic approach would be to measure the cost of KV cache serialization/transfer against the gain from hardware specialization.

This scenario represents a “homelab” disaggregated serving on recycled enterprise hardware — an area with very little public documentation.


9. Open Challenges


10. From Manual Disaggregation to Production

The manual implementation (vLLM + Python proxy + NixlConnector) demonstrates the principles but lacks the components needed for a production environment:

Frameworks like NVIDIA AI Dynamo implement these components. However, understanding the underlying protocol is essential before adopting higher-level abstractions: if you don’t know the proxy → prefill → metadata → decode flow, you cannot diagnose when something doesn’t work or doesn’t perform.


11. Decision Model

11.1 When to Use Each Architecture

  Single Node Replicated Disaggregated
Proxy complexity None Stateless round-robin Stateful, 2-phase protocol
Network requirements N/A TCP sufficient RDMA for medium/large models; TCP 10-25 Gbps for small models
Latency (1 req) Minimal ~+6% Higher
Throughput (N req) Limited to 1 GPU ~2× (linear) High (pipeline)
Granular scaling No Only whole replicas Independent prefill/decode
Ideal use case Dev, test, low concurrency Medium load, simplicity High concurrency, asymmetric workloads

11.2 Conditions Favorable to Disaggregation

Condition Why
Long input sequences Heavy prefill → benefits from dedicated compute nodes
Long output Lots of time in decode → amortizes setup and transfer cost
High concurrency Prefill/decode interference in the batch becomes the dominant bottleneck
Fast network (≥100 Gbps RDMA) KV cache transfer does not cancel the advantage
Large models (>30B) Prefill cost is significant, separation has real impact
Asymmetric workloads Long prompts with short replies (or vice versa), where one phase dominates the other
Recurring shared prefixes System prompt, RAG → KV pooling (§5) amplifies the gain

11.3 Unfavorable Conditions

Condition Why
Small models (<7B) Prefill already fast, little to gain from separation
Short prompts + short outputs Too little time in either phase to amortize overhead
Low concurrency No interference to eliminate
Slow network (<10 Gbps) KV cache transfer becomes the new bottleneck
Single-node deployment Coordination overhead exceeds the benefit

11.4 Gray Area (To Validate with Benchmarks)

Condition Approach
Medium models (7B-30B) Measure if KV transfer time < time saved
Medium concurrency (10-50) Depends on prefill/decode ratio in the workload
10-25 Gbps network Feasible with KV compression, validate empirically
SD + disaggregated The double SD + transfer overhead requires high acceptance rate to be net positive

12. Scheduling, DPU-Side Logic and Roadmap

12.1 Current State: CPU-Side Router

Every disaggregated system has a router that decides:

  1. Which prefill node to send a request to (ideally one that already has part of the prefix in local cache).
  2. Which decode node to route the resulting session to (ideally the same node that received the just-generated KV cache, to avoid additional transfers).
  3. How to handle follow-up turns (same decode node as the session, if possible, to reuse the KV cache).

Today this router is almost always CPU-side: vLLM router, NVIDIA Dynamo frontend, custom gateways. It works, it’s the default, scalability is known.

12.2 Direction: DPU-Side Scheduling

The direction — pushed by NVIDIA via DOCA — is to move part of the dispatch logic onto the DPU to:

Watch the framing: it’s early adopter territory in 2026, not standard practice. Worth flagging in technical conversations as a plausible evolution, not as consolidated reality. Anyone presenting it as deployment-ready today is anticipating the roadmap.

12.3 Fact vs Direction — Quick Reference

To distinguish in technical conversations (presales, architecture review):

Fact today (2026):

Direction (roadmap, early adopter):


13. Three Takeaways

Three statements that distinguish an engineer’s discourse from a brochure’s:

  1. The DPU is an amplifier, not a prerequisite. GPUDirect RDMA runs on ConnectX. The DPU adds multi-tenant isolation, storage offload (NVMe-oF target in hardware), and — prospectively — scheduling. Presenting it as a necessary condition for disaggregated inference is marketing, not architecture.
  2. Distinguish fact from direction. Fact today: GPUDirect RDMA, NVMe-oF for KV pool, Spectrum-X, Dynamo. Direction: networked CXL, DPU-side scheduling, memory pooling over fabric.
  3. The KPI is TTFT, not GB/s. Bandwidth serves to meet end-to-end latency; latency serves to meet perceived SLO. Whoever sells bandwidth without contextualizing the SLO is selling the wrong thing. TTFT and sustained throughput in tokens/s under SLO are the numbers that matter to the end client.

Disaggregated inference is one of the few topics where network infrastructure, storage and GPU compute converge on the same problem. Whoever works in AI infrastructure — pre-sales, solutions architecture, inference engineering — must know how to tell the story knowing where the technical fact ends and the roadmap begins.


14. References


Dielabs — AI Inference Engineering Lab dielabs.github.io