The Inference Engineering Manual - Dielabs version

Chapter 0

Disclaimer: I tried to create a manual with well-differentiated chapters, but I prioritized — draft after draft, correction after correction — the accuracy of what you read over strict ordering. This means that Chapter Zero already includes fairly deep technical concepts that were originally planned for Chapter One, favoring comfortable reading over rigid concept sequencing at the expense of time I would have spent enforcing precise order. Consider this a beta manual, though one that is technically as free of errors and inaccuracies as possible.

When OpenAI releases GPT-4, or Meta publishes Llama 3, the world sees only the model ready for inference: billions of parameters trained on petabytes of data, capable of conversing, reasoning, coding. But between the moment those parameters are saved to disk and the moment a user receives a response, there is an entire invisible system that determines whether that service will cost $2 per million tokens or $20, whether it will respond in 200 milliseconds or 5 seconds, whether it will serve 100 simultaneous users or 10,000.

That system is called inference infrastructure.

A Note on Units of Measurement

In this manual we use GB (gigabytes) and MB (megabytes) following the common convention in AI/ML documentation and hardware datasheets. When we say “70GB for a 70B model in INT8,” we mean 70 billion bytes (70 × 10⁹).

Why this choice: Most documentation (NVIDIA, PyTorch, cloud providers) uses GB/MB ambiguously, sometimes referring to binary units (GiB = 1024³ bytes) and sometimes to decimal units (GB = 10⁹ bytes). To align with standard industry language, we use GB/MB with the understanding that they approximately represent memory capacity.

Calculation precision: 70 decimal GB (70 × 10⁹ bytes) correspond to ~65.2 binary GiB (70 × 10⁹ / 1024³). This ~7% difference is acceptable for capacity planning and architectural reasoning: the variation is smaller than the overhead caused by frameworks (PyTorch/vLLM), memory fragmentation, and dynamic OS allocations.

Ch 0: Birth of a Model, and Some Technical Deep Dives

An LLM is not born intelligent. At the beginning it is just an empty mathematical structure, a maze of billions of random numbers. It becomes capable of generating text through a rigorous training process that transforms statistics into language. In this chapter we will explore the lifecycle of a decoder-only Transformer model, the architecture that dominates the industry today.

The Base Structure: The Blueprint

First, you need to define the architecture: a neural network composed of dozens of identical layers stacked on top of each other.

Design choices

Before training begins, the model architecture is defined. The three choices with direct impact on runtime performance are:

Hidden dimension: the “width” of each layer. Determines the size of the weights and the computational cost for each token processed.
Number of layers: the “depth” of the model. Determines latency — data must traverse every layer in sequence, without skipping any.
Context length: the maximum sequence length the model can process. It is hardwired into the architecture during training and cannot be extended in deployment. It determines the maximum KV Cache size at runtime.

Impact on VRAM: These choices define the physical “weight” of the model. The wider the model, the more space it will occupy in GPU VRAM, both for the weights and for the KV Cache during use.

The Training Mechanism: The Error Signal

There is no training without comparison. Training is a mathematical optimization loop based on “checking” the answer:

Forward Pass (Prediction): The model receives a sequence and attempts to guess the next word.
The “Check” (Loss Calculation): The system compares the prediction with the correct answer. Without this comparison, the model would have no way of knowing whether it is improving or getting worse.
Backpropagation: The error is sent “backward” through the layers. The system calculates how much each weight contributed to the error.
Optimizer: The weights are slightly modified to reduce the error on the next attempt.

In summary: Training is a continuous dialogue between the model and a “judge” who possesses the truth.

Pre-training: Self-Supervised Learning

What happens: The model, initialized with random weights, is exposed to trillions of tokens of real text. For each sequence, the framework hides the last token and challenges the model to predict it. The weights are updated billions of times. The model learns the structure of language and of the world.

Who does the “check”?: The training framework operates in self-supervised mode. It takes a real sentence, hides part of it, and challenges the model to reconstruct it. Since the software possesses the original text, it acts as a mechanical judge: it has the “truth” in hand and triggers backpropagation every time the model gets it wrong.

Supervised Fine Tuning: The Teacher’s Control

What happens: The pre-trained model is exposed to curated datasets of ideal question/answer pairs. It learns the format of conversation and to follow instructions.

Who does the “check”?: Human experts who write datasets of “Question and Ideal Answer.” The system compares the model’s response with the perfect one written by the human and corrects the weights accordingly.

Reinforcement Learning from Human Feedback: The Critic’s Control

What happens: The model generates multiple response variants to the same question. Humans (or reward models) compare them and vote for the best one. The weights are updated to favor preferred behaviors.

Who does the “check”?: Humans (or reward models) who vote among different generated responses, indicating which is the safest and most relevant.

Checkpoint: The Model Gets “Frozen”

At the end of all training phases, the final parameters are saved in a file called a Checkpoint.

The Watershed: Here the dialogue ends. The “check” disappears because it is no longer needed for learning. Backpropagation is turned off and the weights become static.
The Beginning of Inference: The model is now a frozen object that will serve to perform prediction calculations. Our job as Inference Engineers is to take this object and optimize it (e.g., via quantization) to run it at maximum speed.

Journey Summary

Architecture → Empty mathematical structure.
Pre-training → Text is the teacher (Self-supervised).
Fine-tuning → The human is the teacher (Supervised).
Reinforcement Learning → Preference is the teacher (Feedback).
Checkpoint → Frozen model (End of the “check” and of learning).
Inference → Static application (The subject of this manual).

Conclusion Training is a journey of constant weight correction. Inference is the application of those weights to serve users. Everything we will see in the following chapters is about how to make this “frozen” model speak as quickly and efficiently as possible.

Structural note — Training vs Inference During training, each sequence is built on real text (ground truth). The model never sees its own outputs as context: it always receives the “truth” truncated. During inference, the context accumulates including tokens generated by the model itself. Each token produced becomes input for the next token — a regime called autoregressive generation. This difference explains why a model can degrade on long sequences: it is operating in a domain (self-generated context) structurally different from the one it was trained on.

0.1 The Frozen Model: Dimensions and Overhead

At the end of training, the model changes state: from modifiable to immutable.

During training, the model’s weights (which are the majority of the model’s parameters — for simplicity in this manual, weights = parameters) are continuously updated. This process requires additional memory beyond the weights themselves: copies of parameters, modification history, and data structures for optimization. A model with 70 billion parameters in training occupies about 420GB of GPU memory. (for now we use the base rule parameters × 6)

In inference, all this complexity vanishes. The weights are fixed, loaded once into memory as read-only data. They are never modified during use. The size depends only on how many bytes you use to represent each parameter (which is essentially a decimal number):

70B parameters × 2 bytes (FP16) = 140GB + ~20GB overhead = 160GB (Standard Configuration)
70B parameters × 1 byte  (INT8) = 70GB  + ~10GB overhead = 80GB  (Balanced Configuration)
70B parameters × 0.5 byte (INT4) = 35GB  + ~5GB overhead  = 40GB  (Optimized Configuration)

VRAM does not host only the model weights — we will see in subsequent chapters what else occupies it during execution.

0.2 The Model Container: Where Weights Live

If the model is the “immutable brain,” the file format is the box you store it in. The file extension is not just a label — it determines the loading strategy and the efficiency with which bytes travel from disk to the GPU’s circuits.

Format Choice: Loading Strategies

Not all files are equal. The choice of container defines which Inference Engine (Layer 3) you can use and how memory will be managed during startup.

“Bare” Weights (.bin / .pt): The old PyTorch standard based on Pickle technology.
- The problem: They require Python to be “unpacked.” The CPU must read the data, reconstruct it in RAM, and then send it to the GPU. This process is slow, requires a lot of system RAM, and presents security risks (malicious code execution).
Protected Weights (.safetensors): The current standard for Enterprise performance and security.
- Zero-Copy Technology: .safetensors offer the ideal implementation of Zero-Copy technology. Through Memory Mapping (mmap), the Inference Engine can map weights directly from disk to GPU VRAM, skipping intermediate steps in CPU RAM.
- Practical advantage: Near-instant startup (fast Cold Start) and the ability to load giant models even on machines with less system RAM.
Compact Weights (.gguf): The universal format for flexible inference (native to llama.cpp).
- Offloading: Its main feature is the ability to split the model: if you have a 35GB model and a 24GB GPU, GGUF lets you load 20GB onto the GPU and “park” the remaining 15GB in system RAM. The model will still work, albeit with reduced performance.
- All in GPU: You can use GGUF even if you have enough VRAM. In that case, you choose it for the convenience of having weights, metadata, and tokenizer in a single ready-to-go file.

Technical Box: Bypassing the CPU

In modern AI datacenters, the unifying principle behind every data transfer optimization is one: eliminate the CPU from the data path.

The CPU and system RAM are general-purpose components — fast for logic, slow as a transit channel for large data volumes. Every time data passes through them, you pay latency and bandwidth that you never recover.

Technologies change, the problem is always the same:

Zero-copy / mmap → storage directly into VRAM, CPU coordinates without touching the data
GPUDirect RDMA → network directly into VRAM, bypassing system RAM
NVLink → GPU-to-GPU direct, without going through PCIe and CPU

The CPU remains essential as an orchestrator — it decides what to move and when — but it no longer physically transports the data. The actual transfer is always handled by DMA (Direct Memory Access) with compatible hardware that moves bytes autonomously.

Practical implication: when you evaluate a hardware architecture for inference, the question is not just “how much VRAM do you have?” but “how many bottlenecks exist in the path between data and GPU?”

The Decision Fork: Algorithm vs Format

As an Inference Engineer, you will often face a full-precision model (FP16/BF16) of 140GB and need to decide how to prepare it for production. The process has two steps:

Step 1 — Algorithm (“how” to compress). Choose the mathematical technique: AWQ, GPTQ, K-Quants, etc. This choice determines the trade-off between compression and residual model quality.

Step 2 — Format (“where” to run it). Choose the serialization container based on the inference engine you will use.

If the goal is…	You will choose Algorithm…	In Format…	For Engine (Layer 3)…
Scalable Production (Cloud)	AWQ / GPTQ	`.safetensors`	vLLM / TensorRT-LLM
Portability / Low VRAM	GGUF (K-Quants)	`.gguf`	llama.cpp / Ollama
Extreme Speed (Single GPU)	EXL2	`Folder/File`	ExLlamaV2

Technical Box: Latency vs Throughput

The choice of format and engine drastically influences server behavior:

GGUF Scenario (Optimized for Latency): Ideal for a single user who wants words to appear instantly on their screen. The perfect choice for local assistants or workstations.
Safetensors + vLLM Scenario (Optimized for Throughput): Ideal for a company that needs to serve 100 employees simultaneously. Thanks to techniques like Continuous Batching, this stack sacrifices a few milliseconds of initial response time to maximize the total number of words generated per second for all users.

In summary: If your goal is the Data Center, .safetensors with Zero-Copy are your standard. If your goal is hardware flexibility, the .gguf format is your best ally.

Case Study: Ollama: “On-Demand” Philosophy (Savings)

Ollama is designed to run on your personal computer, where the GPU also serves your monitor, browser, or video games.

Behavior: By default, Ollama loads the model only when needed and “evicts” it from VRAM after 5 minutes of inactivity.
Consequence: Every time you come back to query it after a pause, you suffer the “Cold Start.” Even with Zero-Copy, you still have to wait for the GB of weights to travel from disk to VRAM.
Advantage: It doesn’t “hijack” your graphics card all day.

vLLM: “Always-On” Philosophy (Performance)

vLLM is designed for servers that do nothing but AI inference.

Behavior: At startup, vLLM analyzes your GPU and says: “This card has 80GB? Good, I’m taking 90% right now and never letting go.”
“Pre-occupation”: vLLM pre-loads not only the weights (immutable), but also pre-allocates all possible space for the KV Cache (the variable overhead).
Consequence: The first response is instantaneous because everything is already “hot” and ready in the GPU circuits.
Drawback: You cannot use that GPU for anything else; vLLM becomes its absolute master.

Although both use Zero-Copy to minimize transfer times from disk, the operational management is opposite:

Ollama (Lazy Loading): Loads weights only on the first request. Optimized for coexistence with other programs on the same machine. The initial delay is the price you pay for flexibility.

vLLM (Static Allocation): Loads everything at startup and pre-emptively occupies nearly all available VRAM to maximize Throughput. Optimized for total responsiveness and multi-user serving.

Pro Tip: In production, Lazy Loading is almost never used. The preference is to keep the model always “hot” in memory to guarantee consistent and low response times (TTFT - Time To First Token).

0.3 Model Efficiency: Beyond the Number of Parameters

We can say that the number of parameters represents the potential “brain capacity,” but actual intelligence depends on how that capacity is used.

Here are the three reasons why a model with fewer parameters can beat a larger one:

Data Quality

A model with 7B parameters trained on 2 trillion high-quality tokens (books, clean code, logical reasoning) will almost always be more “intelligent” than a 70B model trained on garbage or repetitive data.

Example: Is a small brain that studied the entire encyclopedia better, or a giant brain that only read YouTube comments?

Information Density (Chinchilla Scaling Laws)

There is an optimal equilibrium point. DeepMind researchers (in the Chinchilla study) discovered that many models were “under-trained.” Today we know that it is often more efficient to train a small model for much longer (exposing it to more data) rather than making an enormous one and training it briefly. A small “dense” model is a dream for an Inference Engineer because it is fast and intelligent.

Architecture: Dense vs MoE (Mixture of Experts) — and Let’s Start Talking About Prefill/Decode

Here the distinction becomes technical.

Dense Models: Every time you ask a question, all 70B parameters “activate.” It’s as if to answer “what is 2+2” you used every single neuron in your brain.
MoE Models (e.g., Mixtral): The model has many parameters (e.g., 47B), but for each word it activates only a small portion (e.g., 12B).

Result: You get the intelligence of a large model with the inference speed of a small one.

Who Chooses Which Experts to Activate? (The Router)

It is not a human and it is not the inference software (vLLM or Ollama) that chooses. The decision is internal to the model and was learned during training.

In an MoE model, each layer (or group of layers) has a component called a Router (or Gating Network).

When a token (a word) arrives at the router, it analyzes it as a mathematical vector.
The Router decides instantly: “For this physics concept, expert #3 and expert #7 are the most competent.”
The signal is sent only to those experts.

Fun fact: The Router is itself a small neural network. During training, it learned to route traffic based on result effectiveness. If expert #3 responded well on physics, the router “reinforced” that connection.

Why Is It “Fast Like a Small One but Intelligent Like a Large One”?

Imagine having an MoE model with 47B total parameters that activates only 12B per token:

Memory Side (VRAM): You must hold all 47B parameters. Your GPU must be large enough to host the entire model (about 94GB in FP16). Here the MoE behaves like a Large model.
Compute Side: When you press “Enter,” the GPU performs calculations only on the 12B activated parameters. Since generation speed depends on how many calculations you do per second, the model responds with the speed of a 12B model. Here the MoE behaves like a Small model.

Technical Box: Prefill vs Decode in MoE

The advantage of 12B active parameters is not distributed equally across the two generation phases.

Decode (generating tokens one by one): this is where MoE expresses maximum advantage. Decode is memory-bandwidth-bound — for each new token generated, the GPU must read from VRAM both the model weights (for the forward pass) and the entire accumulated KV cache (for attention computation over all previous context). MoE reduces the first of these two costs: by activating only 2 experts out of 8, the GPU reads ~12B of weights instead of 47B. The cost of reading the KV cache remains unchanged. All 47B parameters reside in VRAM, but the router activates only 2 experts out of 8 per token: the GPU effectively reads ~12B instead of 47B. Fewer bytes transferred on the memory bus → directly faster generation. The benefit is full and immediate.

Prefill (processing the initial prompt): the advantage exists but is less pronounced. Prefill is compute-bound — the GPU processes all prompt tokens in a single parallel batch, which raises arithmetic intensity (FLOP/byte ratio) and saturates compute units before bandwidth. MoE reduces effective FLOPs (activating ~12B instead of 47B), so prefill is still faster compared to an equivalent dense model — but the gain is not linear as in decode, where the bottleneck (bandwidth) is attacked directly.

In summary: MoE is structurally optimized for the most critical phase — decode — which is exactly the one that determines user-perceived latency during generation.

Dense vs MoE: The Comparison

To make the concept crystal clear, here is a comparative table:

Numbers are rounded for clarity. Mixtral 8x7B has an architecture with shared layers resulting in ~46.7B effective parameters, with ~13B active per token.

Practical Implication: An MoE model is an efficiency “trick.” It lets you have the precision of a vast model (because it has seen lots of data and is specialized in many fields) while maintaining the responsiveness of a much leaner model. However, you don’t save on VRAM: you still need space to host all experts, even those that will remain “silent” most of the time. If you have plenty of VRAM but want lightning-fast responses for your users, an MoE is the perfect choice. If instead your VRAM is limited, a smaller but well-trained Dense model might be easier to manage.

The Inference Engineer’s Point of View

For you, a model with too many parameters relative to its intelligence is an efficiency problem. If an 8B model (like Llama 3) achieves results similar to an old 30B model, your job becomes immensely easier:

Less VRAM occupied.
Reduced latency (faster responses).
Hardware costs plummet.

In summary: Parameters are like an athlete’s height: being tall helps in basketball, but being 2 meters tall is not enough to be Michael Jordan. You need technique (the architecture) and training (the data).

0.4 Architectures: Decoder, Encoder, Hybrids

The model architecture determines how you must manage memory. There are three main types, each with different characteristics for those who must put them into production.

Decoder-only models (GPT, Llama, Mistral) generate text one word at a time, where each new word depends on all previous ones. To avoid recalculating the entire context at each step, these models use a structure called KV cache that stores intermediate results. This cache grows linearly: the longer the context, the more memory it occupies. If you serve 10 simultaneous users with conversations of 4000 tokens each, the KV cache occupies far more space than the model parameters themselves. These models dominate production because generative AI (chat, code, content creation) is the economically relevant workload.

Encoder-only models (BERT) process the input all at once in a single pass. They do not generate text word-by-word but produce a fixed-size result: embeddings or classifications. They do not need growing memory structures because the input is fixed and the output is fixed. This makes them easy to parallelize: you can process 100 sentences simultaneously without mutual dependencies. They are used for retrieval, classification, or embedding generation.

Encoder-decoder models (T5, BART) combine both approaches. One part of the model (encoder) processes the input all at once. The other part (decoder) generates output word-by-word. You must manage two separate caches: one for the encoder (which remains fixed after the first calculation) and one for the decoder (which grows with each generated token). These models are more complex to serve and are used primarily for translation or summarization when input is very long.

For the inference engineer, decoder-only represents over 90% of production work. The techniques discussed throughout the rest of this manual focus on this architecture.

0.5 The “Journey” Inside the Model

For an inference engineer, an AI model is not a magical entity but a sequential data flow. Understanding how data moves through the model structure is the key to optimizing performance.

System Components — with Prefill and Decode

To understand the “journey,” we must define the four main actors:

Parameters (the cargo): the numerical values (the weights) that the model learned during training. They represent knowledge. The more parameters there are, the heavier the “baggage” the GPU must process.

Layers (the stations): the model is divided into sequential levels. Each layer’s job is to refine information. Data cannot skip levels: it must enter Layer 1, exit, enter Layer 2, and so on.

The KV Cache (crystallized context): the numerical representations of every already-processed token, saved layer by layer and reused by attention to generate each new token without recalculating the entire sequence.

The Forward Pass (the journey): the complete execution through all model layers, from input to the prediction of a single next token. The mechanism is always the same — attention, FFN (or MoE), layer after layer. What changes is the initial condition:

Prefill: when the KV cache is empty. The model receives the entire prompt (N tokens), processes them in parallel, populates the KV cache layer by layer, and generates the first output token. High arithmetic intensity operation → compute-bound.
Decode: when the KV cache is already populated. The model receives a single token (the last generated one), compares it with all the crystallized context in the cache, adds its own representation, and predicts the next token. The cycle repeats until the end of the response. Low arithmetic intensity but high bandwidth consumption operation → memory-bandwidth-bound.

Optimizing the Journey: Shortening Response Times

The Inference Engineer’s goal is to reduce decode latency — the time needed to generate each individual token. This is the latency perceived by the user: the speed at which words appear on screen. If inference is a journey inside the model’s layers, optimizing means shortening the route, lightening the load, or traveling in groups.

Strategies divide into two categories: Hardware-Efficient (how we read data) and Algorithmic (how we manage computation).

1. Hardware-Efficient Strategies (Optimizing movement)

Reducing weight (Quantization): The most direct strategy. Transforming parameters from FP16 (2 bytes) to INT4 (0.5 bytes) reduces by 75% the amount of data the GPU must read from memory at each pass. Less data to move = less waiting time for the processor.
Parallelism (Model Parallelism): If a model is too large for a single GPU, you split it across multiple chips. Tensor Parallelism fragments the calculations of each layer across multiple GPUs in parallel, while Pipeline Parallelism distributes layers sequentially (GPU 1 processes layers 1-40, GPU 2 processes layers 41-80).
Kernel Optimization (Flash Attention): You don’t change what you calculate, but how you physically write it in GPU memory. Flash Attention reduces the “round trip” data journeys between slow memory (HBM) and fast memory (SRAM) on the chip, drastically accelerating attention computation, especially with very long texts.

2. Algorithmic Strategies (Optimizing logic)

Paged Attention (GPU virtual memory): without this optimization, each request pre-allocates a contiguous block of VRAM for the maximum possible context length. If maximum context is 8K tokens but the response uses 500, the rest is wasted and non-reusable VRAM. With many concurrent requests, fragmentation becomes the real bottleneck. Paged Attention — introduced by vLLM — applies the same principle as paged virtual memory in operating systems: it allocates the KV cache in small, non-contiguous blocks, assigned on-demand as the sequence grows. Result: nearly zero waste, more requests serviceable in parallel with the same VRAM.
Speculative Decoding (the fortune teller): a tiny, ultra-fast “sketch” model (e.g., 1B) is placed alongside the large model (e.g., 70B). The small one tries to guess the next 5-6 tokens; the large one verifies them all together in a single forward pass. If the small one guesses correctly — very likely for articles, conjunctions, predictable patterns — we have generated multiple tokens in the time of one.
Continuous Batching (the dynamic elevator): in traditional systems, the server processed a fixed group of requests and waited for all to finish before accepting new ones. Those who finished early waited for the others — the GPU ran partially idle. With Continuous Batching, each request is independent: as soon as one finishes, a new one immediately enters without waiting for the others. The GPU always works at full capacity.

Three optimizations, three different bottlenecks: Paged Attention attacks VRAM waste, Speculative Decoding attacks autoregressive decode latency, Continuous Batching attacks system throughput.

Hardware Implications of the Journey

As an engineer, you must always relate software theory to hardware limits. Here is how to balance the components:

RAM and VRAM (Load Capacity)

Relationship: Parameters determine how much VRAM you need.
Advice: Never saturate VRAM at 100% with parameters alone. You must leave room for the KV Cache and for the Forward Pass’s temporary data. If a 70B model occupies 35GB in INT4, a 40GB GPU is the bare minimum, but an 80GB one is ideal for managing long contexts and many users — we will elaborate on this concept.

Memory Bandwidth (Cruising Speed)

Relationship: The Forward Pass in decode requires continuously reading parameters from VRAM. Reading speed depends on Memory Bandwidth — the width of the channel connecting VRAM to the GPU’s compute cores, measured in GB/s.
Advice: Professional GPUs with HBM memory (NVIDIA A100/H100) have superior bandwidth compared to consumer GPUs with GDDR6.

Latency vs Throughput

Latency — the time the user waits. It consists of two phases:

Prefill: the prompt (input) is processed in a single parallel forward pass, and by default the first new token is also generated. Duration scales with prompt length.
Decode: the output is generated via one sequential forward pass for each token that must be produced. Each forward pass traverses all model layers in series — more layers, combined with the KV cache, mean more time to produce a single token.

The final perceived latency is the sum of prefill and all decode steps. Model depth (number of layers) directly impacts per-token time, which accumulates linearly with response length.

Throughput — output tokens generated per second summing all concurrent users, measured in the decode phase (decode produces the output). The key mechanism is batching: model weights are read from VRAM only once per forward pass and applied to all sequences in the batch simultaneously. Batch size 1 reads all weights and produces 1 token; batch size 32 reads the same weights once and produces 32 tokens. Same bandwidth cost, multiplied output. Continuous batching applies to decode: it groups the decode steps of multiple requests into the same GPU batch, inserting and removing requests dynamically.

Prefill is already parallelizable on its own, so it benefits less.

The constraint on batch size is VRAM occupancy: each concurrent sequence maintains its own KV cache in memory. When VRAM saturates with KV cache, the batch cannot grow — and this is one of the ways throughput hits its ceiling. Prefill does not produce output but competes for the same GPU resources. A long prefill can delay the decode steps of sequences already in batch — techniques like chunked prefill and disaggregated serving exist to mitigate this interference.

The trade-off — with the same hardware, latency and throughput compete for the same resources. Increasing batch size improves throughput (more total tokens/s) but can increase latency per individual user (more sequences competing for compute cores). Optimization consists of finding the batch size that maximizes throughput without violating latency constraints — the operational equilibrium point.

0.6 Quantization, Recap

Quantization is the process of reducing the bits used to represent each model parameter. Instead of using the 16-bit standard (FP16/BF16), you go down to 8 bits (INT8), 4 bits (INT4), or even less.

Why It Works: Error Tolerance

Large Language Models are incredibly resilient mathematical structures. Reducing numerical precision introduces statistical “noise,” but the model is able to tolerate these approximations without the text quality degrading drastically. A model in INT4 produces slightly different responses than FP16, but for the end user the difference is often imperceptible.

Why It Is Vital: The Bandwidth Wall

The advantage of quantization is not just disk space; it is speed of movement. In inference, the bottleneck is almost never the GPU’s compute power, but the speed at which data travels from memory (VRAM) to the compute cores.

Engineering Logic: If the GPU must read 140GB (FP16) versus 35GB (INT4), the second operation will theoretically be 4 times faster. Fewer bits = less data to move = faster generation.

This directly impacts three operational pillars:

Capacity (How many users?): Determines whether you can load the model at all. If a 70B in FP16 occupies 140GB and you have an 80GB GPU, deployment is impossible. In INT4 the same model occupies 35GB, leaving you a full 45GB free to manage the KV Cache of hundreds of concurrent users.
Speed (How fast?): The GPU spends most of its time “waiting” for parameters. Reducing data weight means reducing idle time, increasing Tokens per Second (TPS).
Cost (How much do I save?): Cost per token decreases proportionally with throughput. If you double the generation speed on the same hardware, you have halved operational costs.

The Trade-off: Quality vs Efficiency

The choice of precision is a business decision, not just a technical one:

Precision	Quality Impact	Ideal Use
FP16 / BF16	None (Original)	Critical sectors: Medicine, Legal, Scientific Research.
INT8	Minimal (1-3% drop)	Enterprise standard for balancing costs and fidelity.
INT4	Visible (5-10% drop)	Consumer chatbots, Local assistants, Large-scale deployment.

Technical Box: Memory-Bound vs Compute-Bound

Note for the Inference Engineer: LLM inference is almost always Memory-Bound, especially during decode. The processor is much faster than memory — the real limit is the bandwidth of the channel between VRAM and compute cores. This is why quantization is the most effective optimization: it does not accelerate the physical channel, but reduces the volume of data to transport. Less data = less waiting = more tokens per second.

0.7 From Training to Serving, Recap

The path of a model from creation to production passes through three stages with different responsibilities.

Training is handled by the research team. It requires high numerical precision to avoid problems during optimization. The modern standard uses BF16 (16 bits, like FP16 but with different numerical characteristics). A 70B model in training occupies 420GB of memory: 140GB for weights, 280GB for optimization structures. This phase produces a final file called a checkpoint.

Release checkpoint is the public format distributed by labs. Weights are saved in standard formats like safetensors or GGUF, typically maintaining training precision (FP16 or BF16). This file is immutable and appears on public repositories like HuggingFace. Download size is determined by the training team: a 70B in FP16 is a 140GB file.

Inference optimization is where you take control. You download the checkpoint and convert it to the precision you need: INT8, INT4, or mixed formats. This conversion is called post-training quantization. It requires running the model on some representative examples (a few thousand) to determine how to compress parameters without losing too much quality. The result is a new file optimized for your deployment.

The decision on which precision to use depends on your hardware constraints and business requirements:

Only have 40GB of available VRAM? You must use INT4.
Can afford 80GB but want to maximize concurrent users? INT8 is a good compromise.
Quality is critical and you have 160GB available? Use FP16.
Want to balance everything? Use mixed formats: critical parameters in FP16, the rest in INT4.

The boundary is sharp: the training team produces frozen weights in high precision. You choose how to represent them in memory based on deployment constraints. You cannot modify the weights (they are frozen), but you can choose the quality/capacity/speed trade-off that works for your use case.

0.8 Why Internal Mechanics Matter

Up to now we have treated the model as a black box: a file of parameters that occupies memory. For an Inference Engineer, this abstraction is insufficient. It is not enough to know that a 70B occupies 140GB; you must understand how it uses those 140GB during execution.

The difference between an efficient and an inefficient deployment is not (only?) in choosing the most powerful GPU, but in understanding which resource saturates the workload. Two deployments of the same model on the same GPU can have throughput differing by 5-10×, not due to software configuration but due to mismatch between workload characteristics and hardware capabilities.

Concrete example: A 70B in FP16 on H100 (80GB VRAM) cannot load the weights (140GB > 80GB). Obvious solution: quantize to INT8 (70GB). The model now fits in VRAM. But after loading it, the GPU operates at 5% (example) utilization during generation. Why? It’s not a compute power problem — the H100 has compute to spare. It’s a bandwidth problem: the cores are ready to calculate but constantly waiting for parameters to arrive from VRAM. The GPU is not underpowered — it’s waiting. This is the Memory-Bound workload in action. Buying an even more powerful GPU would solve nothing — the cores would still be waiting. The only effective lever is reducing the volume of data to transport: quantization, optimized formats, architectures that minimize memory reads.

To understand why this happens at a physical level, you must look inside the forward pass — that is what Part 1 does.

Part 1 explains the model’s internal mechanics because it determines: - Where resources go: parameters vs KV cache - Which resource saturates the workload (memory vs compute) - Which optimizations work for your specific case. After Part 1 you will understand why a system can have 40% free VRAM yet still be slow, or have GPU at 5% but still saturate throughput. These paradoxes are resolved by understanding the workload at the physical level.

Chapter 1 Ch 1: How a Model Behaves

In 0.5 we saw the theoretical “journey” of data. Now we descend into the silicon to quantify the bottlenecks. If in Chapter 0 we took a photo with the engine off, here we analyze the telemetry while the car is at 300 km/h.

1.1 The Two Physics: Prefill vs Decode (Quick Recap)

As we have seen, inference is a two-phase game with opposing workloads.

Prefill (Initialization): The model reads the prompt (e.g., 1000 tokens) in one shot. The GPU operates in Compute-bound mode: the cores are saturated (60-70% utilization) because they have massive parallel data to work on.

Decode (Generation): The model generates one token at a time. Here we enter Memory-bound mode. Speed is dictated exclusively by how fast VRAM “fires” data to the cores.

1.2 The Memory Wall: Why the GPU Sits at 5%

Why does an H100 seem to “sleep” during chat? Let’s start from the simplest and most concrete case possible: the first generated token of a new conversation, with a single request in progress (batch=1). At this point the KV cache is still empty and the only data the memory channel must transport is the model weights.

Take a 70B model in INT8 (70GB).

1. Read Time An H100 has a memory bandwidth of 3.35 TB/s (3350 GB/s). Reading 70GB of weights requires:

70 GB ÷ 3350 GB/s = 0.0209 s ≈ 21 ms per token  [batch=1, empty KV cache]

(Theoretical lower bound — minimum case. With growing KV cache or more simultaneous requests, the load increases: we see this in the next section.)

2. Compute Time The H100 cores execute the math on that single token in less than 1ms.

3. The Result

1 ms work ÷ 22 ms total ≈ 4.5% utilization  [batch=1]

4.5% is the most favorable case: first token, a single request in progress, empty cache. This is the ceiling of utilization in single-request decode mode — in reality it goes lower.

Hardware Comparison: Bandwidth Determines Performance

Hardware	Bandwidth	Token/s (70B INT8, batch=1)
NVIDIA H100	3350 GB/s	45–50 (Reference)
NVIDIA A100	2000 GB/s	27–30 (~60% of H100)
RTX 4090	1008 GB/s	13–15 (Consumer limit)

Indicative values at batch=1 (single request) — with more simultaneous requests the picture changes significantly: we see this in the next section.

Operational Insight: If your load is 90% chat (heavy decode), buying the GPU with more TFLOPS is a mistake. You must look at GB/s. A GPU with half the compute power but double the memory bandwidth will generate responses twice as fast — with equal simultaneous requests. Increasing the number of parallel requests changes the equation.

1.3 Anatomy of a Live Forward Pass: Weights, KV Cache, and Batching

The 4.5% calculated above was the best case: first token, a single request in progress (batch=1), empty KV cache. In production both assumptions fall simultaneously: the cache grows with every token, and multiple requests are served in parallel. Let’s see what happens to the memory channel.

The two pieces of luggage on the memory channel:

Parameters (The Fixed Weight): 70GB that cross the channel at every decode cycle. Here parallelism is an advantage: with 8 simultaneous requests (batch=8), these 70GB are read only once to serve all 8 — the fixed cost is amortized across the entire group.

The KV Cache (The Dynamic Weight): Did not exist at token 1. From token 2 onward it grows at each step and is individual per request — it is not shared and cannot be amortized. In fact, the number of simultaneous requests directly multiplies it: 8 requests mean 8 separate caches.

Cache explosion during generation (70B Model)

Cache per token = 2 × num_layers × hidden_dim × bytes_per_param
2 × 80 × 8192 × 2 bytes = 2.6 MB per token per request

Forward pass comparison varying context and simultaneous requests:

Token 1, batch=1 (empty cache):
  Weights:   70.0 GB
  KV cache:   ~0.0 GB
  ─────────────────────────────────
  Total:     70.0 GB ÷ 3350 GB/s ≈ 21 ms → 4.5% utilization

Token 1000, batch=1:
  Weights:   70.0 GB
  KV cache:   2.6 GB  (1 request × 2.6 MB × 1000 tokens)
  ─────────────────────────────────
  Total:     72.6 GB ÷ 3350 GB/s ≈ 21.7 ms → 4.4% utilization

Token 4000, batch=1:
  Weights:   70.0 GB
  KV cache:  10.4 GB  (1 request × 2.6 MB × 4000 tokens)
  ─────────────────────────────────
  Total:     80.4 GB ÷ 3350 GB/s ≈ 24.0 ms → 4.0% utilization

Token 4000, batch=10:
  Weights:   70.0 GB  (read once for all 10 requests)
  KV cache: 104.0 GB  (10 requests × 2.6 MB × 4000 tokens)
  ─────────────────────────────────
  Total:    174.0 GB ÷ 3350 GB/s ≈ 52 ms → 1.9% utilization

Parallelism amortizes weights but multiplies the cache. With long contexts and many simultaneous requests, the KV cache surpasses the model weight itself and becomes the dominant factor in bandwidth consumption — and the reason why cache management is one of the central problems of modern inference engineering.

Scenario: 50 users @ 4000 tokens each

Cache per user: 10.4 GB
Total system cache: 50 × 10.4 GB = 520 GB (total VRAM occupied)

Read breakdown per single forward pass (batch size = 1): 70 GB weights + 10.4 GB user cache = ~80 GB per step

Read breakdown per single forward pass (batch size = 50): 70 GB weights + 520 GB cache = ~590 GB per step — impractical on a single GPU.

In production with continuous batching, vLLM typically groups moderate-sized batches. Weights are amortized across the batch, but each user’s KV cache is read separately and accumulates. The result is that with long contexts, cache dominates the bandwidth load even at the level of a single step.

Implication: The real constraint is not aggregate throughput, but available VRAM. 520 GB of total cache requires a multi-GPU cluster. On a single GPU, the number of users serviceable in parallel is directly limited by residual VRAM after loading the weights.

1.4 Batching: Amortizing the Fixed Cost

Sequences and Batches: Operational Definitions

A sequence is the unit of work of the inference engine: it represents a single request in progress, with its prompt, already-generated tokens, and associated KV cache. From the engine’s perspective, each sequence is an independent state that evolves token by token.

A batch is the set of sequences that the engine processes simultaneously in a single forward pass. The batch size is the number of sequences in that group.

The distinction is important: a request is the application-level concept — what arrives from the client. A sequence is its internal representation within the engine, with all the memory state that accompanies it. In performance and scheduling contexts, we always talk about sequences.

Request (application level)
    └── Sequence (engine level: prompt + generated tokens + KV cache)
            └── Batch (N sequences processed in a single forward pass)

In the previous section we saw that at batch=1 — a single sequence in progress — the model weights are read entirely at each generated token, regardless of how many requests are waiting in queue. It is a structural waste: the memory channel transports 70GB to produce a single token, for a single user.

Batching resolves this problem by aggregating multiple sequences into a single forward pass. The 70GB of weights are read only once, but the result serves N sequences simultaneously. The fixed cost divides by N.

Weight cost per sequence = 70 GB ÷ N sequences in batch

batch=1:   70.0 GB per sequence
batch=8:    8.75 GB per sequence
batch=32:   2.19 GB per sequence

This is why throughput — tokens generated per second across the entire system — grows with batch size. Not because the GPU becomes faster, but because its fixed cost is spread across more useful work.

The KV Cache Does Not Amortize

Weights are shared among all sequences in the batch. The KV cache is not — it is individual by definition. Each sequence carries its own cache, proportional to the length of its context, and that cache must reside in VRAM for the entire duration of the sequence.

The result is that the two components of the memory load scale in opposite directions as the batch increases:

batch=10, token 4000:
  Weights:   70.0 GB  (fixed, independent of batch)
  KV cache: 104.0 GB  (10 sequences × 2.6 MB × 4000 tokens)
  ──────────────────────────────────────────────────────
  Total:    174.0 GB

batch=20, token 4000:
  Weights:   70.0 GB  (unchanged)
  KV cache: 208.0 GB  (20 sequences × 2.6 MB × 4000 tokens)
  ──────────────────────────────────────────────────────
  Total:    278.0 GB

Doubling the batch doubles the KV cache. Weights remain identical. Beyond a certain threshold, available VRAM becomes the constraint that limits maximum batch size — not compute power.

Batch Size Trade-offs

Latency per sequence: Each forward pass processes N sequences simultaneously, which means each individual sequence waits for the entire batch to complete the step before receiving its token. Increasing batch size increases global throughput but lengthens the time between one token and the next for each individual user.

VRAM saturation: With N sequences active simultaneously, their N caches must coexist in VRAM at the same instant. Space cannot be freed until a sequence terminates. The maximum sustainable batch size depends directly on VRAM available after loading the model weights.

Diminishing returns: The throughput gain per additional sequence is not linear. In the first steps — from batch=1 to batch=8, for example — the gain is substantial because you are amortizing a high fixed cost over a small denominator. Beyond a certain threshold the engine encounters other bottlenecks and the marginal gain progressively decreases.

Marginal throughput gain (indicative):

batch=1  → batch=2:   +80-90%
batch=2  → batch=4:   +40-50%
batch=4  → batch=8:   +20-30%
batch=8  → batch=16:  +10-15%
batch=16 → batch=32:  +5-10%

Sequence Length and Batch Size: The Combined Constraint

Batch size and sequence length are not independent — they compete for the same VRAM. The relationship is direct:

VRAM available for KV cache = Total VRAM - model weights

Maximum sustainable sequences =
    VRAM available for KV cache ÷ (2.6 MB × average sequence length)

With short sequences (quick chats, contexts under 1000 tokens) you can sustain a high batch size. With long sequences (extended contexts, documents, multi-step reasoning) the maximum batch size collapses — not because the engine is misconfigured, but because the VRAM math leaves no room.

This is the fundamental constraint that every production configuration must balance: high batch size requires short sequences, or more VRAM.

In Summary

The optimal batch size is not the maximum technically sustainable — it is the point where throughput and latency balance for the specific use case. A consumer chat system requires low per-token latency, which pushes toward moderate batch sizes. An overnight batch processing job can tolerate high per-sequence latency in exchange for maximum throughput — same hardware, opposite configuration.

recap

Batching is useful because in the forward pass there exists work common to all sequences: the use of model weights. Processing more sequences together increases GPU efficiency and overall throughput, also because the GPU is designed to execute many operations in parallel, thus better utilizing its compute units. Basically, need more throughput? Increase batching — need more latency? Decrease it.

The chapter dedicated to vLLM will show how these parameters translate into concrete configuration and how the engine dynamically manages sequence admission to respect VRAM constraints in real time.

Toward Part 2

In Chapter 1 we quantified the two fundamental problems of inference: bandwidth is the real bottleneck during decode, and the KV Cache grows linearly and inevitably until it dominates the load on memory. With long conversations and many simultaneous users, VRAM quickly becomes the system’s limit — not because of weights, but because of cache.

In Part 2 we will see how PagedAttention tackles this problem at the root, managing the KV Cache dynamically instead of allocating it statically — allowing many more users to be served with the same VRAM.

Chapter 2: Why a Model Costs Resources

In Chapter 1 we saw that decode is limited by memory read speed, and that the memory channel carries two distinct pieces of luggage: model weights and the KV cache. This part analyzes the second piece in detail — the one that grows, that is not shared, and that determines the real number of users you can serve. Understanding how the KV cache works, how much it costs, and how to manage it is the central problem of production inference engineering.

2.1 The Model’s Short-Term Memory

During generation, the model does not work in a vacuum: each token produced depends on all previous ones. To avoid recalculating the entire context for each new token, the system uses the KV Cache.

The Model: transforms each token into two mathematical vectors (Key and Value) that represent that word’s contribution to the sequence context.
The Serving System: stores these vectors in GPU VRAM.
The Inference Engineer: must manage the space occupied by these vectors. Each added token is not just a mathematical operation — it is a cost in bytes that accumulates for the entire duration of the sequence.

As seen in Ch 1, the cache is born during prefill — when the model processes the initial prompt and calculates K and V vectors for each context token, saving them in VRAM. During decode, the cache is only read: for each new generated token, the model consults all already-saved vectors without recalculating them. This is exactly the computational saving — but the memory cost remains and grows with each produced token.

2.2 The Economic Account: How Much Does a Token Cost?

The KV cache size is not fixed: it grows linearly during generation. For a model like Llama 3 70B, the calculation is:

Bytes per token = 2 × num_layers × hidden_dim × bytes_per_param

2: K and V vectors
80 layers: the model’s “floors”
8192: the vector dimension (Hidden Dimension)
2 bytes: weight of each number in FP16

2 × 80 × 8192 × 2 = 2,621,440 bytes ≈ 2.6 MB per token

This explains the numbers seen in Ch 1.3: the cache starts from nearly zero and reaches 10.4 GB for a 4000-token conversation. With 50 simultaneous sequences at that context level, cache alone occupies 520 GB — dwarfing the model’s 70 GB of weights. Batch size, as seen in Ch 1.4, does not amortize this cost: each sequence carries its own cache, and all must coexist in VRAM simultaneously.

Technical Box: The GQA Trick

Many modern models use Grouped Query Attention. Instead of saving cache for each attention head, keys are shared between groups — drastically reducing the cost per token.

Llama 3 70B uses 8 KV heads across 64 attention heads: the reduction factor is exactly 8×, bringing the cost per token from ~2.6 MB to ~0.32 MB. The factor depends on the specific architecture — 8× is the value for Llama 3 70B, other models have different values.

The growth dynamic remains identical: the cache will still end up dominating VRAM in multi-sequence scenarios, it will simply take more context to get there.

KV cache optimizations divide into three structural categories, each with a different objective:

Space management (PagedAttention): eliminate VRAM waste through dynamic allocation
Context sharing (Prefix Sharing): avoid duplicating identical caches across different sequences
Weight reduction (Quantization): decrease bytes per token to lighten the bandwidth load

The first two increase capacity — how many sequences you can serve simultaneously. The third increases bandwidth efficiency — how quickly you respond to each. Ch 2.6 shows why these two levers do not always move in the same direction.

2.3 Fragmentation and the PagedAttention Solution

Before innovations like PagedAttention, the cache was managed as contiguous blocks. If you predicted a 4000-token conversation, the system would immediately seize 10.4 GB of VRAM — regardless of how much context the sequence would actually use.

This created two distinct problems:

Internal fragmentation: if you allocate space for 4000 tokens and the sequence uses only 200, 95% of the memory is occupied but empty. Like booking a table for 20 people for a dinner of two.

External fragmentation: if you have 20 GB free but distributed in scattered blocks among various active sequences, you cannot admit a new sequence that requires 10 GB contiguous — even though the total space would suffice.

PagedAttention solves both problems by applying the concept of virtual memory from operating systems:

The cache is broken into small, fixed pages (e.g., 16 tokens per page)
Physical pages can be scattered anywhere in VRAM — they need not be contiguous
A Page Table per sequence keeps track of where its conversation fragments are

Result: VRAM utilization goes from ~30% to ~90%. The same GPU can serve 2-3 times more simultaneous sequences because space is allocated a piece at a time, only when needed. This is also what makes it possible to increase sustainable batch size without wasting VRAM on contexts never reached.

PagedAttention enables a further optimization: Prefix Sharing.

With traditional contiguous allocation, each sequence has physically separate memory — impossible to share blocks between different sequences. With pages, you just point the Page Tables of different sequences to the same physical pages for the common prefix.

Typical use case: a chatbot with a fixed 500-token system prompt shared by 100 simultaneous sequences.

Without Prefix Sharing: those 500 tokens are calculated and stored 100 times — one copy per sequence.

With Prefix Sharing: they are calculated and stored once. All 100 sequences read the same physical pages until they start generating their individual responses — from that point their caches diverge and become independent.

2.5 Cache Quantization (FP8 / INT8)

The third lever is reducing the weight of each token in cache by going from FP16 to INT8 or FP8. The trade-off is much more favorable compared to model weight quantization:

Characteristic	Weight Quantization (INT4)	KV Cache Quantization (FP8/INT8)
Memory Savings	4× (Static)	2× (Dynamic)
Degradation	5-10% (Permanent)	<1% (Temporary per session)
Bandwidth Advantage	Speeds up weight loading	Reduces data read in decode

Why is the degradation so low? KV cache values are intermediate computation results — not the model’s deep weights. A small loss of precision on these temporary values has negligible impact on final response quality.

Rule of thumb: if you are short on VRAM, quantizing the cache is the first move. It does not touch the model’s intelligence — it only reduces the precision of temporary data that exists only for the duration of the sequence.

2.6 The Trade-off: Capacity vs Bandwidth

The three structural optimizations act on two distinct levers that do not always move in the same direction:

Capacity (how many sequences fit): solved by PagedAttention and Prefix Sharing — less space occupied per sequence, more sequences in the same VRAM.

Bandwidth efficiency (how quickly you respond): solved by quantization — less data to read per forward pass, more tokens per second.

Example of conflict — Prefix Sharing:

On the capacity front: 100 sequences share the same system prompt. Instead of 100 copies in cache, only one exists — 99% savings for that block.

On the bandwidth front: no savings. To generate the response for each sequence, the GPU must still read those data from VRAM for each forward pass. Sharing eliminates duplication in memory, but does not reduce reads during computation.

In summary: PagedAttention and Prefix Sharing let you fit more sequences into the GPU. Quantization lets you respond faster to those sequences. They are complementary — in production they are used together.

Box: Advanced KV Cache Management Techniques

The three structural optimizations above are implemented by default in modern inference engines. Additional, more situational techniques exist that intervene when VRAM or latency constraints become extreme:

KV Cache Offloading (CPU/NVMe) When VRAM is saturated, the cache of less active sequences is moved to system RAM or NVMe storage, and reloaded when that sequence becomes active again. The trade-off is latency: CPU→GPU bandwidth (typically 32-64 GB/s on PCIe) is two orders of magnitude lower than internal VRAM bandwidth. Useful for sequences with very long contexts and low access frequency. This is one of the levers we directly measured at Dielabs with LMCache on Qwen3-8B-AWQ, with results showing concrete benefits under real memory pressure — the chapter dedicated to vLLM elaborates on the data.

Sliding Window Attention Instead of maintaining cache for all context tokens, the model keeps only a window of the N most recent tokens. Tokens outside the window are discarded. VRAM savings are linear with window reduction, but the model loses access to remote context — an architectural choice that must be made during training, not deployment. Used in Mistral 7B v0.1.

StreamingLLM A variant of sliding window that also keeps the very first tokens of the context (so-called “attention sinks”) in addition to the recent sliding window. Empirical observation: initial tokens receive disproportionate attention regardless of content, and removing them destabilizes the model. StreamingLLM preserves them, achieving stability with arbitrarily long contexts at nearly constant VRAM cost.

Token Eviction / Selective Importance Instead of discarding tokens by position (sliding window), those that received the least attention in past iterations are discarded — assuming they are less relevant for future tokens. Techniques like H2O (Heavy Hitter Oracle) dynamically select which tokens to keep. The risk is losing semantically important context that was temporarily ignored.

These advanced techniques share a common trade-off: they reduce cache cost at the expense of quality, latency, or architectural flexibility. The choice depends on the specific use case and hardware constraints.

Chapter 3 Ch 3: Where the Model Lives

Layer separation is a diagnostic protocol. The value is not in knowing which layer a component belongs to, but in knowing which layer has the authority to modify the behavior that generates the symptom.

A problem is solved in the correct layer when the fix does not require changes to adjacent layers and does not create regressions in the layers above. If you are solving an OOM by changing the gateway configuration instead of acting on the backend, you are working in the wrong layer — you are masking the symptom, not removing the cause.

L0 - The Silicon:

Hardware choice is based not only on TFLOPS, but on the balance between Capacity (how much VRAM you have) and Bandwidth (how fast the pipe is).

NVIDIA H100: 80GB VRAM, 3.35 TB/s Bandwidth. It is the Ferrari: generates tokens quickly but has limited capacity for giant models.
DGX Spark / Unified Systems: 256GB VRAM, 273 GB/s Bandwidth. It is the truck: can load enormous models (high capacity) but generates tokens much more slowly.

Interconnects: Transform isolated GPUs into a coherent system.

NVLink (900 GB/s): The “private tunnel” intra-node. The only one fast enough for Tensor Parallelism (splitting a computation across GPUs in the same server) thanks to ultra-low latency (~1-2 μs). About 15-20× faster than PCIe.
InfiniBand (~200 GB/s): A dedicated network protocol and fabric, separate from Ethernet. Very low latency (~1 μs), high performance, but requires expensive proprietary infrastructure. Used in high-end HPC and AI clusters.
RoCE — RDMA over Converged Ethernet (25-100 GB/s): RDMA implemented over standard Ethernet. Same mechanism as InfiniBand — direct transfer between memories of different servers without involving the CPU — but on already-existing Ethernet infrastructure. Slightly higher latency than InfiniBand, significantly lower cost. In modern AI datacenters it is the prevalent choice for inter-node connectivity.
PCIe (32-64 GB/s): The system bus — connects GPU to CPU and system RAM. It is the “default” channel present in any x86 server. Used to transfer data from disk or network to GPU, but too slow to coordinate distributed computations between GPUs. If two GPUs communicate only via PCIe, the bottleneck is immediate.

L1 - Drivers

Layer 1 is the software layer between the inference framework and the silicon. Its job is twofold: translate mathematical operations into executable hardware instructions, and coordinate communication when the workload is distributed across multiple GPUs.

Without this layer, vLLM or TensorRT-LLM would not know how to talk to the GPU — nor how to synchronize two GPUs processing parts of the same model. The most important technologies residing in this layer are:

Driver (NVIDIA): the foundation of the entire stack. The driver version installed on the host defines which CUDA version is supportable. A container requiring CUDA 12.x will not start on a host with driver 515.x — the minimum required version is 525.x. Everything else depends on this layer.

CUDA Runtime & API: translates the mathematical operations of frameworks (PyTorch, vLLM) into instructions executable by GPU cores. Every inference framework depends on a specific CUDA version to access hardware features.

NCCL (NVIDIA Collective Communications Library): manages synchronization and data transfer between multiple GPUs — both intra-node via NVLink and inter-node via InfiniBand/RoCE.

L2 – Runtime & Packaging

L2 is the layer that isolates and distributes the execution environment. In practice, it is the world of containers.

The components are containerd or Docker as the container runtime, nvidia-container-runtime as the bridge between the container and the host’s GPU driver, CUDA-based images as the base for the entire environment, and compiled libraries — PyTorch, TensorRT-LLM, vLLM — distributed as part of the image.

Decisions made at L2 are binding for all upper layers: which CUDA version to use, which libraries to link, which backend version to deploy. This layer is also the source of one of the most frequent problems in multi-node environments: image drift. The model works on node A and crashes on node B for no apparent reason. The cause is almost always a different image, a different library, or a different runtime version between nodes. L2 not managed with rigor becomes an infinite source of irreproducibility.

L3 – Inference Backend (Execution Engine)

L3 is the computational heart. This is where the model actually executes the forward pass.

Practical examples are vLLM, TensorRT-LLM, llama.cpp, and TGI. This layer’s responsibilities include loading weights into VRAM, managing the KV cache, GPU memory allocation, executing CUDA kernels, and continuous batching if integrated into the backend.

An important detail for understanding L3: this layer does not know about users. It does not know there are ten connected clients or ten thousand. It works on tensors and batches. The abstraction toward the outside world happens in the upper layers.

Typical L3 problems are OOM from KV cache when context grows beyond expectations, memory fragmentation that degrades performance over time, inefficient batching that leaves the GPU underutilized, and kernels not optimized for the specific hardware or numerical precision.

L4 – Serving & Orchestration

L3 decides how to compute.

L4a decides where and when to route the request.

This layer’s components are request routers, inference-aware load balancers, and serving-level schedulers (e.g., vLLM’s internal router in multi-instance, Triton model ensemble routing, KServe/Seldon request routing). Functions include dispatching requests to the correct instance, load balancing based on inference-specific metrics (queue depth, KV cache pressure, TTFT), and priority management among concurrent requests.

The distinction with L3 is critical. A perfectly optimized L3 backend can appear unusable if L4a routes poorly. The typical problem is GPU idle with a full queue: it is not a backend problem, it is a routing policy problem. The inference arrived, the resource is available, but the router is not assigning it correctly. Debugging here does not go through nvtop — it goes through router logs and queue depth metrics per instance.

L4b – GPU Workload Optimization

L4a decides where to send the request. L4b decides how to optimize the workload on the GPU once it has arrived.

This layer’s components are dynamic batching mechanisms (continuous batching, chunked prefill), intra-GPU scheduling (iteration-level scheduling, preemption), and KV cache management (paged attention, offload, prefix caching). Functions include maximizing per-GPU throughput while keeping latency below SLOs, managing GPU memory among concurrent requests, and optimizing the prefill/decode ratio.

The distinction with L4a is the level of granularity: L4a reasons at the request and instance level, L4b reasons at the batch and iteration level on the individual GPU. A perfect L4a that sends requests to the right instance is useless if L4b does not do continuous batching and leaves the GPU waiting between requests. Symmetrically, an optimized L4b with PagedAttention and chunked prefill is nullified if L4a overloads one instance and leaves another empty.

L5 – API Gateway / Edge

L5 is the layer that protects the backend from everything coming from the outside.

Functions include rate limiting, authentication, request validation, context cap enforcement, and multi-tenant isolation. It is a layer often underestimated during design and overestimated during debugging.

The consequence of a misconfigured L5 is direct and brutal: a single tenant can saturate the backend’s VRAM by sending excessively long prompts, or can consume all capacity with unlimited requests, degrading service for everyone else. Context cap enforcement is not a luxury function — it is the primary defense against externally induced OOM.

L6 – Client

L6 is the origin of real load. It is the layer that generates the requests the entire stack must serve.

Critical choices at the client level are session management (stateless vs stateful), context length sent, streaming vs blocking usage, and retry policy on error.

The relevant operational fact: a stateless client multiplies prefill cost. If every request sends the entire conversation history instead of maintaining state, the backend must recalculate prefill from scratch each time. The result is high TTFT that looks like a GPU problem but is an application design problem. The fix is not in the backend — it is in the client.

Appendix A on troubleshooting will present a further deep dive into the layers, dedicated exclusively to identifying and resolving the specific problems that may manifest in the different layers.

Toward Part 4

The stack defines where code runs. Part 4 addresses how to distribute a model too large for a single GPU: we will explore Tensor Parallelism (splitting the matrices of each layer across multiple GPUs) and Pipeline Parallelism (splitting layers sequentially across multiple GPUs) to scale beyond the physical limits of a single chip.

Chapter 4 Ch 4: How to Optimize the Model

Part 3 showed where responsibilities live. This part analyzes the specific techniques for optimizing throughput and latency, manipulating the physical constraints (Memory Wall) and data structures (KV Cache) analyzed in previous chapters.

4.1 Batching: The Efficiency of Shared Travel

The LLM problem is that the GPU spends 95% of its time reading parameters from memory. Batching is the only weapon to amortize this cost.

The Batching Math (70B INT8 Model):

Single request: Read 70GB → generate 1 token → time: 21ms.
Batch of 4 requests: Read 70GB once → generate 4 tokens → time: 21ms.
Result: Throughput goes from 1 to 4 tokens/21ms. A 4× speedup at equal bandwidth.

The Evolution: From Static to Continuous

Static Batching: The system waits until 8 people arrive before starting the bus. If one user writes a novel and the others write one word, the seats remain occupied and unused until the last person finishes.
Continuous Batching: As soon as a user finishes, they get off the bus. At the next stop, a new user immediately boards from the queue. The GPU always travels at full capacity.

Chunked Prefill: The “Stutter” Problem

Boarding a user with a long prompt (Prefill) “clogs” the system, creating jitter for those already in generation phase (Decode).

Numerical example: An atomic prefill of 1000 tokens blocks the system for ~50ms (3 lost decode cycles). With Chunking, we divide it into 4 pieces of 250 tokens (~12ms each), interleaving them with other users’ generation. Result: zero “stutters.”

Layer Responsibility: L3 (Inference Backend)

Config Parameter (vLLM): max_num_seqs (maximum batch size), enable_chunked_prefill=True

See: Part 3.3 (vLLM Architecture).

4.2 Parallelism: Splitting the Model Across GPUs

When a model does not fit in a single GPU, we must distribute it. Speed depends on the “nerves” between chips.

Intra-Node: Tensor Parallelism (TP)

We split the matrices of each layer across GPUs in the same server.

Dynamic: Requires continuous synchronization via NVLink (900 GB/s) — L0 component.
Overhead: 80 layers × 2μs latency = 160μs (~1.6% of total time). Almost invisible.

Inter-Node: Pipeline Parallelism (PP)

We split the model sequentially across different servers (e.g., GPU 1: layers 1-40; GPU 2: layers 41-80).

Dynamic: Handoff via InfiniBand/RoCE (25-50 GB/s) — inter-node L0 fabric.
RDMA (Remote Direct Memory Access): Fundamental in this phase. It allows Node A’s GPU to write data directly into Node B’s GPU VRAM, bypassing CPU and OS. Without RDMA the overhead would be 20μs; with RDMA it drops to 1-5μs.

Layer Responsibility: L0 (NVLink/InfiniBand fabric) + L3 (Inference Backend — manages layer distribution and synchronization)

Config Parameter (vLLM): tensor_parallel_size=N, pipeline_parallel_size=N

See: Part 3.3 (vLLM Architecture).

4.3 Prefix Caching: Don’t Repeat Yourself

In RAG or Enterprise Chatbot systems, many requests share the same System Prompt.

Without Caching: Each request recalculates the prefill for common tokens.
With Caching: The system saves the prefix’s KV Cache in VRAM. If a request with an identical prefix arrives, it loads instantly.
Impact: Time to First Token (TTFT) drops drastically (up to 80%).

Layer Responsibility: L3 (Inference Backend)

Config Parameter (vLLM): enable_prefix_caching=True

See: Part 2.4 (Prefix Sharing) and Part 3.3 (vLLM Backend).

4.4 Speculative Decoding: The Fortune Teller and the Master

Decode is sequential. This technique breaks the limit using a “Fortune Teller.”

Draft Model (The Fortune Teller): Small (1B-7B), ultra-fast, imprecise. It bets on the next 5 words.
Target Model (The Master): Large (70B), slow, infallible. It checks all 5 words in one shot.

VRAM Requirements: Loading a 70B (70GB) and a 7B (7GB) on an 80GB GPU is risky. Often you need to quantize the target to INT4 (35GB) to leave room for the KV cache.

Layer Responsibility: L3 (Inference Backend — both models live in the backend, verification is internal to the forward pass)

Config Parameter (vLLM): speculative_model="draft_model_name", num_speculative_tokens=5

See: Part 3.3 (vLLM Backend).

4.5 Decision Making: Final Cheat Sheet

If the problem is…	Technique to use	Layer	Config Parameter
Full queue, underutilized GPU	Continuous Batching	L4b	`max_num_seqs` ↑
High TTFT under load (prefill blocks decode)	Chunked Prefill	L4b	`enable_chunked_prefill=True`
High TTFT (repeated prefixes)	Prefix Caching	L4b	`enable_prefix_caching=True`
OOM on long sequences / many concurrent requests	Paged Attention + KV Offload	L4b	`gpu_memory_utilization`, LMCache config
Unpredictable latency, random spikes	Preemption / iteration-level scheduling	L4b	Scheduler policy config
Low TPS, slow decode	Speculative Decoding	L3	`speculative_model` config
Model too large for one GPU	Tensor Parallelism	L0 + L3	`tensor_parallel_size=N`
Model distributed multi-node	Pipeline Parallelism	L0 + L3	`pipeline_parallel_size=N`
Low throughput on small model, single GPU	Data Parallelism	L3 + L4a	`data_parallel_size=N` + router
One instance saturated, others idle	Request Load Balancing	L4a	Router policy / queue-aware LB
Request drops under traffic bursts	Admission Control / Queuing	L4a	Queue depth limits, timeout config

Toward Part 5

We have the tools. In Part 5 we will see how to combine them into real Deployment Topologies: from massive Cloud clusters to local workstations, analyzing costs, scalability, and operational risks.

Chapter 5 Ch 5: Architecture and Economics of a Model

Parts 1-4 built the engines. Part 5 is the vehicle choice: we analyze how to combine hardware, network, and software into real architectures, balancing performance and budget.

5.1 Deployment Topologies: Choose Your Constraint

In inference, you cannot have everything. You must choose which “bottleneck” to accept in exchange for other advantages.

A. Datacenter Scale-Out (The Throughput Ferrari)

Architecture: Cluster of nodes (8x GPU) connected via NVLink internally and InfiniBand externally.
Strength: Generation speed (TPS) and massive scalability.
Constraint: Extreme complexity. Requires a dedicated SRE team to manage networking.

B. Unified Memory Workstation (The Capacity “Truck”)

Architecture: CPU and GPU share the same RAM.
Hardware examples:
- Apple M3 Max: 128GB unified, ~400 GB/s bandwidth.
- NVIDIA Grace Hopper: 512GB unified, 450 GB/s bandwidth.
- (For comparison: an H100 has 80GB HBM3 at 3350 GB/s).
Strength: Enormous capacity (175B+ models) at a fraction of the cost.
Constraint: Bandwidth. Unified memory is 7-8× slower than HBM. If an H100 generates 50 tokens/s, here you will generate 6-7.

C. Cloud API (The Public Bus)

Architecture: Managed services (OpenAI, AWS Bedrock). Pay-per-use.
Constraint: Noisy Neighbor Effect. Your workload shares nodes with others. If another user “pushes” hard, your lane gets occupied by a truck.
- Impact: Under congestion, TTFT can go from 300ms to 1200ms (4× degradation).

5.2 Network Topology: The Cost of Distance

Execution speed is often limited by cables, not chips.

NVLink (L0 - Internal): 900 GB/s. Here Tensor Parallelism (TP) is “free” (1.5% overhead). GPUs act as a single brain.
InfiniBand (L0 - External): Specified at 400 Gb/s (NDR), effective bidirectional speed 25-50 GB/s. TP becomes expensive. The Hybrid approach is used: TP inside the server, Pipeline Parallelism (PP) between servers via RDMA.
Ethernet (L0 - Standard): 100 GbE (~12.5 GB/s). Latency 10-50μs. TP is unusable (>25% overhead). Only PP with micro-batching is sustainable.

Layer Responsibility: L0 (Hardware — intra-node NVLink and inter-node InfiniBand/Ethernet fabric) + L3 (Inference Backend — manages TP and PP on top of that fabric)

Config Parameter (vLLM): tensor_parallel_size=N, pipeline_parallel_size=N

Regarding PP: the real gain is only being able to load a model that does not fit in a single node. It has mostly downsides: the slowest node blocks everything, if a node goes down everything stops, balancing is non-trivial, and the inter-node link is critical in case of multiple batches (bubble).

5.3 System Metrics: Beyond Average Speed

Don’t be fooled by the average (p50). In production, a system is only as good as its p99.

TTFT (Responsiveness): Time to first token. Target: < 500ms (p95).
TPOT / ITL (Fluidity): Time between tokens. Target: < 50ms (p95). Unstable ITL causes the “stuttering text” effect.
Goodput: The truth metric. It is the number of generated tokens that meet the SLOs (Service Level Objectives).

Efficiency = Goodput / Throughput. Increasing batch size increases throughput, but if it destroys latency, your Goodput collapses.

5.4 Cost Modeling: The Economics of the Token

Optimization is a direct financial saving. Let’s analyze an 8x H100 server:

CAPEX: $200,000 amortized over 3 years → $182/day.
OPEX (Power & Cooling): 5kW IT load with a PUE (Power Usage Effectiveness) of 1.4 → 7kW total. At $0.12/kWh → $20/day.
Total Daily Cost: $202/day.

Inference Engineering Impact:

Deployment A (Unoptimized): 300 token/s goodput → $7.80 per 1M tokens.
Deployment B (vLLM + Caching + Quant): 1000 token/s goodput → $2.34 per 1M tokens.

5.6 Decision Tree: Which Architecture for You?

START: Identify your primary constraint

Volume < 10M tokens/month → Cloud API. Zero CAPEX, immediate time-to-market.
Volume 10-100M tokens/month:
- Latency critical + Technical team → Single-node self-hosted (8x H100).
- No SRE team → Hybrid (Base self-hosted + Cloud burst).
Volume > 100M tokens/month:
- Critical Privacy/Compliance → Private Datacenter (Multi-node InfiniBand).
- Maximum efficiency + SRE team → Large-scale cluster (TP + PP).

Conclusion of the Manual

Being an Inference Engineer means mediating between mathematics and physics. There is no perfect configuration, only efficiency relative to constraints.

Efficiency Beats Size:

Poorly configured 70B model: 200 token/s on 2x H100 → $8 per 1M tokens.
Optimized 7B model: 180 token/s on 1x RTX 4090 → $1.20 per 1M tokens.

A smaller, correctly optimized model beats a poorly managed giant model on every economic and operational metric.

Chapter 6 Epilogue: From Abstraction to Reality

If I had to condense this manual into operational principles for an Inference Engineer, they would be these:

1. In decode, bandwidth beats compute always

Don’t buy GPUs for TFLOPS. Buy for GB/s of memory bandwidth and GB of total VRAM. An H100 with 989 theoretical TFLOPS produces 30-50 effective TFLOPS in decode because it is limited by 3.35 TB/s bandwidth. A hypothetical GPU with 500 TFLOPS but 6 TB/s bandwidth would be faster for LLM workloads.

Practical implication: when comparing hardware, the relevant metric is not “how fast it computes” but “how fast it reads from memory.” Memory bandwidth determines tokens/second in decode, which is >90% of total inference time.

2. The KV cache is the hidden cost

When sizing infrastructure, don’t calculate only parameter size. Calculate:

Total memory = Parameters + (Concurrent users × Average context × Cache per token)

Where cache_per_token depends on architecture:
- Dense model 70B: 2.6 MB/token
- GQA model 70B: ~0.32 MB/token (8× reduction, see Ch 2.2 box)
- Formula: 2 × Layers × Hidden_dim × Precision

In multi-user production, cache exceeds parameters as total memory consumption. Optimizing it (quantization FP16→INT8, PagedAttention, prefix sharing) has higher ROI than optimizing weights because it impacts more simultaneously serviceable users and bandwidth consumption per request.

3. Batching is the only universal optimization

All other techniques have conditions:

Tensor Parallelism requires NVLink (intra-node only)
Speculative decoding requires predictable workloads (acceptance rate >60%)
Prefix caching requires repetitive prompts (common system prompts, RAG injection)

Continuous batching always works. It improves throughput by amortizing memory access across multiple users (reading 70GB of parameters once to generate 4 tokens instead of 1 = 4× speedup) without degrading individual latency (no head-of-line blocking thanks to chunked prefill). If you implement only one optimization, let it be this.

4. Measure percentiles, not averages

A system with p50 latency 200ms but p99 latency 5000ms is a failed system for production. Users live in the long tail of the distribution. If your SLO is “p95 TTFT <500ms,” the 95th slowest user defines the system, not the arithmetic mean.

Concrete example:

System A: mean 300ms, p50 250ms, p95 480ms, p99 3500ms
System B: mean 400ms, p50 380ms, p95 450ms, p99 600ms

System B is superior for production despite worse average.
Controlled p99 beats low mean with unstable tail.

Continuous batching, chunked prefill, queue management, and autoscaling exist to control percentiles, not to lower the average.

5. Cost per token is the only business metric

GPU costs $X/hour regardless of how many tokens you produce. Every idle second is wasted CAPEX amortization. Every token generated outside SLO is computational waste (high throughput, low goodput). Efficiency is not a technical virtue but an economic imperative.

Quantification (from Ch 5.4):

Unoptimized deployment:
- Throughput: 300 token/sec goodput
- Cost: $7.80 per 1M tokens
- Daily capacity: 25.9M tokens

Optimized deployment (vLLM + batching + quantization + caching):
- Throughput: 1000 token/sec goodput
- Cost: $2.34 per 1M tokens
- Daily capacity: 86.4M tokens

Same hardware, 70% saving per token, 3.3× more users served.

Every optimization discussed in Part 4 translates directly into cost per token reduction or capacity increase. The business case for inference engineering is mathematical, not speculative.

6. Complexity lives somewhere, it never disappears

Cloud abstraction hides complexity but you pay for it 10-100× in $/token (see Ch 5 decision tree: cloud $3-10 per 1M tokens vs self-hosted $1-2). Datacenter scale-out requires a dedicated SRE team to manage multi-node networking (InfiniBand, RDMA, orchestration). Workstation scale-up accepts limited throughput (bandwidth 7-8× slower than H100) for operational simplicity.

There is no “simple and performant and cheap” simultaneously. Choose which two you get:

Simple + Performant: cloud managed services, you pay premium cost
Performant + Cheap: datacenter self-hosted, you pay complexity ops
Simple + Cheap: workstation scale-up, you pay throughput limitation

The third attribute is always sacrificed. The art is choosing consciously which one.

What We Left Out

This manual focused on invariant architectural fundamentals, not on ephemeral practical implementation. We deliberately omitted:

Framework-specific implementation details: how to configure vLLM config.yaml, Kubernetes deployment YAML syntax, troubleshooting CUDA driver version mismatch. These change rapidly with every release; the principles (memory wall, cache economics, batching amortization) do not. Appendix A provides a troubleshooting guide but is a temporal snapshot, not a permanent reference.

Framework comparisons: detailed comparison of PyTorch vs TensorRT-LLM vs llama.cpp vs Triton. The choice of serving framework is a consequence of deployment architecture, not the other way around. If you have identified the bottleneck (e.g., saturated memory bandwidth), choose the framework with the most efficient kernels for that specific workload. The framework does not solve architectural problems.

Model training: how backpropagation works, dataset preparation, hyperparameter tuning, distributed training strategies (ZeRO, FSDP). The inference engineer receives frozen models from the training team; training is a separate domain with opposite constraints (you need to modify parameters, not serve them).

Application layer logic: prompt engineering techniques, RAG architecture patterns, agent frameworks (LangChain, LlamaIndex), semantic caching, query routing. These live above Layer 6 (client) in our stack. Infrastructure serves requests regardless of semantic content; semantic optimization is the application developer’s responsibility.

Unvalidated cutting-edge research: emerging techniques published in the last 3-6 months without verified production deployment. We favored battle-tested techniques at scale (continuous batching, PagedAttention, quantization aware training) over recent promising but unproven papers under adversarial production conditions.

Exotic hardware: deployment on TPU, Cerebras wafer-scale, Graphcore IPU, custom ASIC. We focused on the NVIDIA GPU ecosystem because it dominates >95% of production deployments. The principles (bandwidth limitation, cache management, parallelism strategies) transfer, but implementation specifics differ.

These omissions are intentional. The manual provides a durable mental model; implementation details evolve too rapidly for static documentation. When vLLM release 0.X becomes 0.X+1, features change but the memory wall remains a physical invariant.

Emerging Directions (Watch This Space)

Inference engineering evolves rapidly. Three trends with production traction deserve attention:

1. Disaggregated Serving

Physical separation between prefill clusters and decode clusters. Prefill cluster (compute-intensive, TP-heavy, 60-70% GPU utilization) optimized for matrix parallelism. Separate decode cluster (memory-intensive, batching-heavy, bandwidth-optimized) optimized for multi-user serving. KV cache transferred between clusters via high-bandwidth RDMA network.

Rationale: opposing hardware requirements. Prefill benefits from high TFLOPS (A100, H100). Decode benefits from high bandwidth and large capacity (potentially less expensive hardware with memory-optimized HBM). Disaggregation allows right-sizing hardware for workload phase-specific needs instead of a single compromise.

Status: production deployment in scale-out datacenter (>100 nodes). Complexity overhead justified only at massive scale where hardware specialization ROI exceeds orchestration cost.

2. Sub-4-bit Quantization with Mixed Precision

INT3, INT2, even binary weights for non-critical layers. Dynamic mixed precision: attention-heavy layers maintain FP8/INT8 for quality, feedforward layers tolerate INT2. Automatic calibration via runtime profiling instead of offline calibration datasets. Model adapts precision per layer based on activation statistics observed during serving.

Rationale: pushing the quality-efficiency trade-off frontier. INT4 is the current sweet spot (4× compression, 5-10% quality loss). INT2 could be the next frontier (8× compression) if quality degradation is controllable via selective per-layer precision.

Status: active research, some production deployments with INT3 on specific models. Not yet ubiquitous like INT4/INT8.

3. End-to-End Kernel Fusion

Compilers (Triton, XLA, TVM) fuse multiple operations into a single kernel launch. Instead of memory roundtrips after each layer (read input → compute → write output → read for next layer), the compute pipeline processes an entire block of layers in a single kernel with intermediate results in fast SRAM instead of slow HBM.

Rationale: dramatically reduces memory traffic. If each layer requires read+write from HBM, 80 layers = 160 memory operations. Kernel fusion reduces to read input → compute all 80 layers in SRAM → write final output = 2 memory operations. Theoretical bandwidth savings of 80×.

Status: production in framework-specific deployments (TensorRT-LLM aggressive fusion, vLLM via Triton kernels). Not yet universal across all serving frameworks.

Other active areas: inference-specific hardware (AWS Inferentia, Google TPU v5e, Groq LPU) sacrifices flexibility for fixed-function pipeline efficiency. Multi-model speculative decoding: draft chains (tiny → small → medium → large) with hierarchical verification. Alternative attention mechanisms (linear attention, state space models Mamba/RWKV) that eliminate quadratic KV cache complexity.

But fundamentals remain invariant: physical memory wall, cache economics (linear growth with context), batching amortization, cost per token as business metric. Techniques change, constraints do not. An inference engineer who masters the principles adapts rapidly to new techniques; one who memorizes current config YAML becomes obsolete with every framework release.

The Craft of the Inference Engineer

You have reached the end of this journey with a complete mental model: from the individual token to the distributed datacenter, from hardware physics to deployment topology, from physical constraints to operational optimizations, from architectural decisions to cost-per-token economics.

The Inference Engineer is a uniquely hybrid figure. You are not an ML researcher (you do not create models, you receive frozen checkpoints). You are not a traditional software engineer (you do not write application business logic). You are not a generic SRE (you manage workloads with physics completely different from stateless web services).

You are a specialized systems architect for workloads with unique characteristics:

Memory-bound instead of compute-bound (95% of time in memory access)
Stateful with linear growth (KV cache accumulates during conversation)
Latency-sensitive with high throughput requirements (p95 TTFT <500ms but serving 1000+ users)
Cost dominated by hardware CAPEX not labor (GPU depreciation, not eng salaries)
Utilization target 80-90% not 30-40% (hardware too expensive to idle)

Your value is not deploying the biggest or newest model. It is extracting maximum goodput from limited hardware, respecting latency SLOs, minimizing cost per token, maintaining operational simplicity where possible.

The Diagnostic Loop

When facing a production problem, apply this systematic workflow:

1. Identify the bottleneck: which resource is saturating?

Check metrics:
- GPU utilization (target: 60-90%)
- VRAM utilization (target: 80-90%)
- Memory bandwidth (target: 80-95%)
- Network bandwidth if multi-node (target: <70%)
- Queue depth (target: controlled, no monotonic growth)

2. Map to the stack layer: where does the problem live?

Use diagnostic table Ch 3.5:
- GPU idle + full queue → L4 (Scheduler)
- Immediate OOM → L3 (Backend memory)
- Progressive OOM → L5 (Gateway limits)
- High latency + active GPU → L0 (Hardware bandwidth)
- Jitter p99 spikes → L4 (Batching policy)

3. Choose appropriate technique: which optimization attacks that constraint?

Use cheat sheet Ch 4.5:
- Full queue → Continuous batching (max_num_seqs ↑)
- Model doesn't fit → Tensor Parallelism or Quantization
- High TTFT with repeated prefixes → Prefix caching
- Low throughput, latency OK → Speculative decoding
- Multi-server latency → Pipeline Parallelism

4. Measure impact: are the metrics improving?

Before-after comparison:
- Latency percentiles (p50, p95, p99)
- Throughput (tokens/sec goodput)
- Cost per token ($/1M)
- GPU utilization (target: increase if was <60%)
- VRAM utilization (target: controlled 80-90%, no OOM)

5. Iterate: has the bottleneck moved?

After fix, which resource now limits?
Typical example sequence:
Fix batching → GPU utilization 40% → 80%
Now VRAM saturates → quantize cache FP16→INT8
Now bandwidth saturates → quantize weights INT8→INT4
Now queue accumulates → scale horizontal, add nodes
Now network latency → optimize topology, InfiniBand

This loop never ends. Every optimization shifts the bottleneck to the next resource. The goal is not a “perfect system” (impossible, there is always a limiting resource) but a balanced system where all critical constraints saturate simultaneously at 80-90% utilization. GPU full, bandwidth high, VRAM controlled, queue stable, SLO respected.

When you reach this equilibrium, you have extracted maximum value from available hardware. The only way to increase capacity further is to add hardware (horizontal scale) or upgrade hardware (vertical scale). Software optimization exhausted.

Conclusion

Artificial intelligence has become a technological commodity: models are public (Llama, Mistral, Gemma), techniques are known (everyone reads the same papers), code is open source (vLLM, TensorRT-LLM, llama.cpp on GitHub). Access to the technology is no longer a competitive differentiator.

The competitive differentiator has shifted to how efficiently you serve the model.

Two companies serve the same 70B model. Same checkpoint weights, same output quality, same sampling parameters. One produces tokens at $8 per million. The other produces at $2 per million. The difference is not in the model (identical) but in the infrastructure:

Company A (naive deployment):
- Static batching (waits for complete batch before processing)
- Contiguous KV cache allocation (60% fragmentation)
- FP16 cache (2× necessary bandwidth)
- No prefix sharing (recalculates system prompt every request)
→ Throughput: 300 token/sec, cost $8/1M tokens

Company B (optimized deployment):
- Continuous batching (zero GPU idle time)
- PagedAttention (90% memory utilization)
- INT8 cache (halves bandwidth consumption)
- Prefix sharing (80% TTFT reduction on common prompts)
→ Throughput: 1000 token/sec, cost $2/1M tokens

Same model, 4× operational cost difference. With equal output quality, whoever has better infrastructure wins economically. Not because they serve more total requests (absolute throughput) but because they serve the same requests at lower marginal cost, enabling sustainable margins or competitive prices that undercut competition.

In a commodity market (same model for everyone), infrastructure efficiency becomes the only defensible moat. A competitor can copy your model choice (public), can copy your prompt engineering (reverse engineer via API), cannot easily copy your infrastructure optimization without equivalent expertise.

This manual has given you the theoretical foundations and operational framework. The rest is deliberate practice: deploy real systems, observe bottlenecks in production traffic, apply targeted optimizations, measure results quantitatively, iterate based on data not intuitions.

Every deployment is unique (different workload patterns, different hardware constraints, different budget constraints, different team capabilities), but the fundamental principles remain universal: physical memory wall, linear cache economics, batching amortization, layer-based diagnostics, cost per token as north star metric.

The maturity of an AI infrastructure is not measured by the deployed model (marketing: “powered by GPT-4!”) but by the silent metrics invisible to end users:

p99 latency under SLO every single day
GPU utilization >80% sustained not spikes
Cost per token in continuous decline quarter-over-quarter
Goodput efficiency >90% (generated tokens meet SLO)
Zero OOM crashes, zero capacity-related downtime

These metrics don’t appear in press releases but determine the long-term economic sustainability of the AI service.

Welcome to the craft of Inference Engineering.

The future of AI is not just better models (research labs produce breakthroughs), but existing models served better (infrastructure engineers extract maximum value). The two disciplines are complementary: researchers push the frontier of what is possible, infrastructure engineers make it economically sustainable at scale.

Your contribution as an Inference Engineer is not visible to the general public but is fundamental to AI democratization: transforming research prototypes (expensive, slow, fragile) into production services (affordable, fast, reliable) accessible to millions of users instead of a privileged few with unlimited budgets.

Every dollar saved in cost per token is a dollar that lowers the entry barrier for startups, expands economically viable use cases, and enables new applications previously unprofitable.

Efficient inference is not just technical optimization. It is an economic enabler that amplifies the impact of every AI research breakthrough.