The LLM Inference Stack Model

The Principle

An LLM inference stack reads like a pyramid: physical at the bottom, user at the top. Every layer does one thing and rests on the one below.

Reading rule: bottom-up, each layer enables the one above. Top-down, each layer depends on the one below.


The Layers

L6 — Client

GUI, CLI, SDK, agent framework. The user request enters here and the response returns here. Manages client-side session (history, context, token streaming) and renders the output. The point where the system meets the human or the application using it.

L5 — Gateway

Reverse proxy, ingress, authentication, rate limiting, per-model or per-tenant routing, TLS termination. The datacenter perimeter: protects, dispatches and enforces access policies before the request reaches the inference engine. Runs on CPU, never touches the GPU.

L4 — Serving Engine

vLLM, TensorRT-LLM, SGLang, TGI. The operational brain of the system. Receives requests from the gateway, queues them, batches them together (continuous batching), manages the KV cache lifecycle (allocation, eviction, prefix reuse) and decides moment by moment which sequences advance on the GPU. This is where throughput and latency are won or lost.

L3 — Model Execution

Frameworks (PyTorch, TensorRT-LLM engine) and CUDA kernels (FlashAttention, MLA, PagedAttention). Runs the forward pass: takes the input tokens, performs the model math on the GPU and produces the logits from which the next token is sampled. The layer where the actual computation happens.

L2.5 — Model Artifact

Model weights, numerical format (FP16, BF16, FP8, AWQ, GPTQ, GGUF), tokenizer, chat template, optional LoRA adapters. The raw material the runtime executes. Not hardware, not scheduling, not kernels: it’’s the what gets executed. Same GPU and same engine produce different results depending on the artifact loaded.

L2 — Container Runtime

Docker or containerd, nvidia-container-runtime, device mapping, driver and library injection. Prepares the isolated environment where the inference process runs and exposes the GPU to it. The bridge between the host operating system and the containerized application.

L1 — Driver & CUDA

NVIDIA kernel driver, CUDA driver and runtime API, NCCL for collective communication, NVML/DCGM for telemetry. The software layer that talks directly to the GPU: allocates VRAM, launches kernels, coordinates GPU-to-GPU communication and collects hardware health signals.

L0.5 — Interconnect & Fabric

NVLink/NVSwitch intra-node, InfiniBand HDR/NDR or RoCEv2 inter-node, switch topology. The fabric connecting GPUs to one another, within the same server and across servers. Almost invisible in single-node; in multi-node or disaggregated prefill/decode scenarios it often becomes the real bottleneck, because the KV cache has to travel between GPUs on different nodes.

L0 — Hardware

GPU (compute + VRAM), CPU, system RAM, PCIe, NVMe storage, power, cooling. The physical constraints that determine the ceiling of the entire stack. No optimization at higher layers can exceed the limits set here.


How to Read the Pyramid

Who consumes sits at L5 → L6 (gateway, client). Who decides sits at L4 (serving engine). Who computes sits at L3 (model execution). What makes it possible sits at L2.5 → L0 (artifact, runtime, driver, fabric, hardware).