CPU-GPU Memory Topology for AI Inference
Scope: Taxonomy of CPU-GPU memory architectures relevant to LLM inference, with focus on modern unified/coherent patterns and their implications for data placement, KV cache tiering, and platform selection.
Status: Reference document. Vendor-verified facts where indicated; forward-looking sections are interpretive and marked as such.
1. The Problem
LLM inference is dominated by memory-bound workloads, not compute-bound ones:
- KV cache grows linearly with context length × batch size, eventually saturating HBM
- Models exceeding HBM capacity (70B+ at FP16, MoE such as DeepSeek-V3 671B) require tiering
- Decode is bandwidth-bound: every generated token reads attention from the entire accumulated KV cache
Consequence: CPU-GPU memory topology — and the nature of the interconnect between the two — becomes a primary architectural factor, not an implementation detail.
2. Conceptual Framing: Two Orthogonal Axes
The terms “shared”, “unified”, and “coherent” are often used interchangeably in marketing material. They are distinct and orthogonal properties:
| Axis | Question | Property type |
|---|---|---|
| Unified | Is there a single virtual address space? | Programming model |
| Coherent | Are views synchronized in hardware? | Hardware implementation |
Four theoretical combinations:
- Neither unified nor coherent → Discrete + classic PCIe (x86 + NVIDIA/AMD GPU via PCIe)
- Unified but not coherent → CUDA UVM over PCIe (software-simulated unification with on-demand page migration)
- Coherent but not unified → Classic multi-CPU NUMA (rare in CPU-GPU context)
- Both unified AND coherent in hardware → Apple Silicon, GH200/GB200, GB10 (Spark), MI300A
Combination (4) is the dominant pattern for modern AI compute.
2.1 Why the Distinction Matters
“Unified” describes what software sees. “Coherent” describes how it’s implemented underneath.
You can have software-simulated unification (UVM) on top of non-coherent hardware — it works, but with hidden copy costs. True hardware coherence eliminates both the copy and the page-fault handling overhead.
3. Operational 4-Quadrant Taxonomy
| Pattern | Address space | HW coherence | Physical memory | CPU↔GPU interconnect | Examples | Typical use case |
|---|---|---|---|---|---|---|
| Discrete + PCIe | Separate (UVM abstracts above) | No | Heterogeneous (HBM + DDR) | PCIe Gen4/5 (~64 GB/s unidirectional) | x86 + H100/B200 PCIe, A100 | Mainstream training & inference, DGX H100/B200 nodes |
| Homogeneous unified | Unified | Yes | Homogeneous (LPDDR only) | Internal SoC fabric (~400-800 GB/s) | Apple M-series, DGX Spark (GB10) | Dev workstation, edge inference, prototyping |
| Heterogeneous unified | Unified | Yes | Heterogeneous (HBM + LPDDR) | NVLink-C2C (~900 GB/s) | GH200, GB200, MI300A | Inference of large models (>HBM), MoE, memory-bound training |
| Legacy shared | Unified but statically partitioned | Partial | Homogeneous (subtracted from CPU) | Internal SoC bus | Older Intel/AMD iGPU | Consumer/office, not relevant for AI |
4. Deep-Dive on the Three AI-Relevant Patterns
4.1 Discrete + PCIe (Row 1)
Characteristics:
- CPU and GPU as separate domains with distinct address spaces
- Data transfers via DMA over PCIe, I/O-class semantics
- Software sees UVM (Unified Virtual Memory) as an abstraction, but page migration and copies happen underneath
Limits for inference:
- PCIe Gen5 x16: ~64 GB/s unidirectional → an order of magnitude below HBM (3-8 TB/s)
- I/O-class latency: TLP overhead, completion, ordering
- CPU-side KV cache offload is practical only at coarse granularity (entire prefixes, dormant sessions)
When it fits:
- Most of the current market (DGX H100/B200, cloud GPU instances)
- Workloads that fit comfortably in HBM
- Horizontal scaling via NVLink/InfiniBand across nodes (the PCIe bottleneck is intra-node)
4.2 Homogeneous Unified (Row 2) — Apple Silicon, DGX Spark
Characteristics:
- Single LPDDR pool on the SoC package
- CPU, GPU (and Apple’s Neural Engine) access the same physical addresses
- Dynamic allocation: no static pre-partitioning, the allocator distributes on demand
- The GPU can claim almost the entire pool if the workload requires it
Concrete examples:
- Apple M3 Ultra: up to 512 GB unified, ~800 GB/s bandwidth
- DGX Spark (GB10): 128 GB LPDDR5X unified, Blackwell GPU + 20-core ARM (Cortex-X925/A725), ~$3,000 at launch. Specs to be verified at actual launch — some details may have changed.
Advantages:
- Simple mental model: “one pool, take what you need”
- No data placement decisions: everything is at the same bandwidth
- GPU-addressable capacity = full system memory limit (no separate HBM cap)
Limits:
- Uniform bandwidth = no fast tier for hot weights
- LPDDR ~500-800 GB/s vs HBM ~5 TB/s → compute-bound workloads are penalized
- Spark/Mac do not replace H100/B200 in HBM — they are complementary platforms
macOS caveat: the system reserves a minimum quota; the GPU ceiling can be raised via sysctl iogpu.wired_limit_mb but a floor remains.
4.3 Heterogeneous Unified (Row 3) — GH200, GB200, MI300A
Characteristics:
- Two physically distinct memory tiers, logically unified
- HBM on the GPU package + LPDDR on the CPU package, joined by a coherent interconnect
- Same virtual address space (NVIDIA: ATS — Address Translation Services)
- Hardware coherence via NVLink-C2C (NVIDIA) or Infinity Fabric (AMD)
Concrete examples:
NVIDIA GH200:
| Memory | Technology | Capacity | Bandwidth |
|---|---|---|---|
| HBM3e | stacked DRAM on Hopper GPU | 96-144 GB | ~4.9 TB/s |
| LPDDR5X | DIMM-like on Grace CPU | 480 GB | ~500 GB/s |
| C2C interconnect | NVLink-C2C | — | 900 GB/s |
NVIDIA GB200: same pattern, Blackwell GPU with larger HBM3e, identical Grace LPDDR.
AMD MI300A: single-package APU with Zen4 CPU + CDNA3 GPU + unified coherent HBM3. Used in El Capitan (LLNL). Pattern similar to GH200, but single-die instead of CPU+GPU split across C2C.
Important: the enabling factor is Grace + NVLink-C2C, not Hopper/Blackwell per se. Hopper/Blackwell as discrete GPUs (in x86 PCIe nodes) do not have unified coherent memory with the host CPU.
Advantages:
- HBM stays where extreme bandwidth matters (active weights, attention compute)
- LPDDR extends capacity (warm KV cache, dormant MoE experts, models >HBM) without copy penalty
- Software sees a single pool → the allocator (or the informed developer) decides tiering
- Enables single-node serving of large models (DeepSeek-V3 671B MoE) without inter-node disaggregation
Trade-offs:
- HBM capacity is design-capped (not expandable)
- Performance depends on data placement: wrong data in LPDDR = decode bandwidth-bound at 500 GB/s instead of 5 TB/s
- Software stack complexity: today’s vLLM CPU-offload is designed for the PCIe model and doesn’t fully exploit coherence
5. The Key Insight: “Not an Access Problem, an Access-Cost Problem”
The GPU has always been able to reach CPU RAM, even over PCIe (DMA, UVM, mapped memory). The PCIe → NVLink-C2C → unified SoC fabric evolution did not change what the GPU can reach. It changed:
- How fast (bandwidth)
- At what latency (memory-class vs I/O-class)
- With what semantics (load/store vs DMA, copy vs direct access)
- At what software overhead (driver/copy engine vs direct access)
Operational framing:
The “memory wall” of AI inference is not a question of access feasibility. It’s a question of access cost.
A natural consequence: classical storage tiering (RAM → SSD → HDD → tape) and modern memory tiering (HBM → coherent LPDDR → remote CPU RAM → CXL → storage) are the same pattern applied at different scales. Everything is reachable; cost scales with distance.
5.1 The CPU is Not a Broker — It’s a Peer
A frequently misunderstood point: when the GPU accesses memory physically located “on the CPU side”, it does not ask the CPU for permission. The CPU as a processor is out of the data path.
What actually happens:
- On PCIe: the GPU’s copy engine (DMA) issues PCIe transactions directly to the CPU’s memory controller. Transactions traverse the PCIe root complex (passive hardware for this flow) and reach the memory controller. The CPU as a processor is not interrupted. This is identical to an NVMe doing DMA into RAM — the CPU does not “know” it’s happening.
- On NVLink-C2C: the GPU issues a load instruction (not a DMA). The address goes to the GPU’s MMU; if the physical address lands in Grace LPDDR, the request travels over C2C and the Grace memory controller responds. It’s the same instruction as a load on HBM — the GPU neither knows nor cares that the data is physically in LPDDR. Only latency changes.
- On Apple/Spark: memory is physically singular; GPU and CPU are peers on the same internal SoC bus.
What the CPU actually does in the access lifecycle:
- Initial setup: the CUDA driver prepares page tables, allocates buffers, configures the copy engine
- Page fault handling: if the page is not in RAM or not pinned, the CPU kernel intervenes (this is why pinned memory is used in AI data paths)
- Coherence protocol: on coherent systems, hardware protocols manage cache coherence — hardware, not CPU software
Everything else is GPU-driven data plane.
Practical implication: in inference, the CPU as a compute device matters only for pre/post-processing, tokenization, and the scheduler/serving framework (vLLM is Python on CPU). For data movement, the CPU is passive infrastructure. Adding more or faster CPU cores does not help inference throughput on an already-configured data path. What helps is interconnect bandwidth (PCIe Gen5 vs Gen4) or replacing the interconnect entirely (C2C, unified fabric).
Correct mental model: CPU and GPU are independent peers that share only memory, not master/slave. Anyone thinking “the GPU asks the CPU” reveals a CPU-centric mindset still rooted in the pre-AI-compute era.
5.2 Physical Path as Architectural Metaphor
The difference between the three patterns is not just bandwidth or absolute latency in numbers. It’s the physical and logical length of the path the GPU must travel to reach non-local memory.
x86 + PCIe GPU (Discrete):
GPU memory controller
↓
GPU PCIe root port
↓
PCIe link (PCB trace, ~cm distance)
↓
CPU PCIe root complex
↓
Mesh / Ring / Infinity Fabric inside the CPU
↓
CPU memory controller (IMC)
↓
DDR DIMM
Every hop with its own protocols (PCIe TLP, internal mesh, DDR command/address). The GPU is dialoguing with a memory subsystem designed to serve the CPU as primary client.
NVLink-C2C (GH200):
GPU memory controller
↓
NVLink-C2C interface (on package, ~mm distance)
↓
Grace CPU memory controller
↓
LPDDR5X
Physically shorter hops, no root complex, no CPU mesh to traverse. C2C is designed as a memory fabric, not an I/O bus — different semantics and latency.
Apple Silicon / Spark (Homogeneous unified):
GPU
↓
Internal SoC fabric (intra-die or intra-package)
↓
Shared memory controller
↓
LPDDR
Path effectively collapsed. Everything on the same silicon or package.
Typical latency for GPU access to CPU-side memory (approximate orders of magnitude — verify against vendor documentation for formal deliverables):
| Pattern | Typical latency | Notes |
|---|---|---|
| x86 + PCIe Gen5 | ~1-2 μs | Round-trip via PCIe + CPU mesh + IMC |
| NVLink-C2C | ~hundreds of ns | Memory-class, no I/O protocol |
| Unified SoC fabric | ~hundreds of ns or less | Quasi-local access |
| Local HBM (reference) | ~tens to ~100 ns | For comparison |
Conceptual transition reflected by these physical paths:
- x86 + PCIe → “GPU as accelerator attached to a CPU system” (Lego pattern: flexible, modular, not optimized for any specific workload)
- GH200 / GB200 → “CPU+GPU as co-designed system” (Grace dimensioned with GPU traffic in mind, C2C is not a bottleneck, address space genuinely peer-to-peer)
- Apple / Spark → “Unified system where there is no longer a CPU memory and a GPU memory” (single memory designed to serve both simultaneously)
Visual synthesis:
On x86+PCIe, the GPU is a guest of the CPU’s memory subsystem. On Grace-Hopper, they are roommates. On Apple Silicon, they live in the same room.
Three different paradigms of hardware co-design, directly reflected in the physical geometry of access paths.
Why this makes strategic sense for vendors: when the GPU is the heart of the system (AI inference/training), the Lego pattern shows its weaknesses. NVIDIA with Grace, Apple with M-series, and AMD with MI300A are all making the same bet: if the primary workload is AI, the system must be designed around the workload, not assembled from general-purpose pieces.
6. Implications for KV Cache Offload
KV cache offload works on all three relevant rows, but with profoundly different mechanisms and costs.
6.1 Offload on Discrete + PCIe (Row 1)
- Tools: vLLM
--cpu-offload-gb, LMCache CPU tier - Mechanism: explicit copy HBM ↔ system RAM via PCIe DMA
- Penalty: ~64 GB/s, I/O-class latency
- Practical granularity: coarse (entire prefixes, dormant sessions, cross-request prefix caching)
- Decode with offloaded KV requires back-and-forth transfer → significant TPOT impact
6.2 Offload on Homogeneous Unified (Row 2)
- “Offload” is almost a misnomer: KV cache lives in the unique pool from the start
- There’s no fast tier and slow tier — it’s all LPDDR at the same bandwidth
- You don’t offload: you just allocate
- Limit: uniform bandwidth ceiling across the entire pool
6.3 Offload on Heterogeneous Unified (Row 3)
- Mechanism: coherent load/store over NVLink-C2C, no copies
- ~900 GB/s vs ~64 GB/s of PCIe → ~14× more bandwidth
- Memory-class latency, not I/O-class
- Practical granularity: fine, layer-by-layer or beyond
- The GPU can perform attention computation with K and V physically resident in Grace LPDDR, simply slower
Typical strategy on GH200:
- Hot KV cache (active requests, last N tokens): HBM
- Warm KV cache (shared prefixes, paused sessions): Grace LPDDR
- Pattern similar to LMCache, but intra-node and copy-free
6.4 Mental Model Shift
| Row 1 (PCIe) | Row 3 (Coherent C2C) |
|---|---|
| “Move data at the right time” | “Leave it where it is, let it be read at lower speed” |
| Optimize when to transfer | Optimize where to allocate initially |
| Data movement | Data placement |
On Row 3 the operative phrase becomes: “How do I move data?” → “Where should data reside?”
7. Software Stack Status (interpretive)
Fact:
- vLLM has CPU offload (
--cpu-offload-gb) designed for the PCIe model with explicit copies - LMCache implements CPU-side KV cache tiering, originally for PCIe nodes
- NIXL (NVIDIA InferenceXfer Library) abstracts the transport layer for KV cache transfers
- NVIDIA Dynamo and llm-d explicitly aim to expose the coherent memory hierarchy as a native tier
Roadmap/trend (interpretive):
- On GH200, current stacks work but are sub-optimal — they don’t fully leverage HW coherence (they continue to perform logically explicit copies)
- Expected direction: software aware of the coherent tier, native KV cache allocation on Grace LPDDR without pseudo-offload
- Software maturity gap is still significant (as of April 2026, to be revisited)
8. Spark vs GH200: Direct Comparison (Row 2 vs Row 3)
| DGX Spark / Mac (Row 2) | GH200 (Row 3) | |
|---|---|---|
| Mental model | “One pool, take what you need” | “Two tiers, decide where to place what” |
| GPU capacity ceiling | Almost the entire system RAM | HBM + LPDDR (sum) |
| Bandwidth ceiling | LPDDR bandwidth for everything (~500-800 GB/s) | HBM (~5 TB/s) for hot tier |
| Data placement complexity | None | Significant |
| Sweet spot | Models that fit but don’t require HBM bandwidth | Large models with hot working set in HBM |
| Cost (order of magnitude) | $3K-$10K | $40K+ per single GH200 |
9. References for Further Reading
Vendor documentation (fact):
- NVIDIA Grace Hopper Superchip Architecture Whitepaper
- NVIDIA Blackwell Architecture Whitepaper
- AMD MI300A documentation (CDNA3 + Zen4 APU)
- Apple Silicon Unified Memory Architecture (developer docs)
Operational/practitioner:
- vLLM documentation: CPU offload, prefix caching
- LMCache documentation
- NIXL / Dynamo / llm-d project READMEs
Roadmap/trend (to verify):
- DGX Spark specifications at actual launch
- Software stack maturity on GH200/GB200 for native KV cache on LPDDR
All bandwidth, capacity, and latency figures are order-of-magnitude approximations. For formal deliverables, always verify against current vendor documentation or workload-specific benchmarks.