When Offloading Doesn’t Offload

KV Cache Offloading in vLLM 0.15.1

Analysis, benchmarks and real limits — An experimental investigation into KV cache offloading mechanisms on NVIDIA RTX 4070 Super (12 GB VRAM, PCIe 4.0)


Abstract

This paper documents an experimental investigation into KV cache offloading mechanisms in vLLM 0.15.1, conducted on a consumer GPU node (NVIDIA RTX 4070 Super, 12 GB VRAM, PCIe 4.0 x16).

The investigation revealed three main results: (1) LMCache operates as an external prefix cache, not as a swap mechanism; (2) the Native OffloadingConnector (introduced in v0.11.0) implements a true preemption recovery offloading, but requires specific conditions to activate; (3) on consumer hardware with PCIe 4.0, the measurable benefit manifests on tail latency (ITL p99 reduced by 23%) but not on average throughput.

The paper describes the methodological path and the unexpected results that led to these conclusions. Formulas are included to support key concepts, but the text remains readable without them. Basic familiarity with inference engineering fundamentals is assumed.


1. Introduction

The KV cache is the most critical data structure in LLM inference. For each request, the model generates key-value pairs for every token processed, with linear growth relative to context length. On consumer GPUs with limited VRAM (12 GB), the KV cache becomes the primary bottleneck for concurrency.

The idea of “offloading” — moving part of the KV cache from GPU to CPU RAM — is conceptually simple and frequently promoted as a solution. However, the term conceals significant architectural complexity. This paper documents the attempt to activate and measure KV cache offloading in vLLM on real homelab hardware, revealing that the concept comprises at least three completely different mechanisms.


2. Background

2.1 KV Cache and Admission Control

For each token processed, the model generates a key-value pair. The KV cache size per request is:

kv_cache_per_request = num_tokens × num_layers × 2 × num_kv_heads × head_dim × bytes_per_element

For Qwen3-8B-AWQ (36 layers, 8 KV heads, head_dim 128, FP16), each token occupies approximately 144 KB. A request with 4096 input tokens generates ~576 MB of KV cache during prefill.

The VRAM available for the KV cache:

kv_cache_budget = (VRAM_total × gpu_memory_utilization) − model_weights − cuda_graphs − overhead

The V1 scheduler in vLLM calculates the maximum theoretical concurrency as kv_cache_budget divided by the worst-case allocation per request (max_model_len × kv_per_token). If there are not enough blocks, the new request enters the wait queue. This admission control is the primary gate governing the number of active requests — and the first thing to consider when attempting to force offloading. As shown in Section 5, the scheduler operates more optimistically than this formula suggests.

2.2 Preemption

Preemption intervenes when KV cache space becomes insufficient for all active requests. Understanding it is fundamental because preemption is the exact point where offloading can intervene: the cost of preemption is what offloading attempts to reduce.

Recompute (default V1). When a request is preempted, its memory blocks are freed and the request returns to the wait queue. When space becomes available, the request is rescheduled and its prefill recomputed from scratch. This design reflects V1’s optimization for reduced CPU overhead — synchronous GPU→CPU swap was removed to simplify the core system.

Preemption lifecycle: (1) Saturation — the scheduler detects insufficient free blocks for the next decode step. (2) Eviction — the scheduler selects requests to suspend (FCFS by default). (3) Block release — KV blocks return to the free queue; with prefix caching enabled, blocks remain identifiable via hash until overwritten. (4) Resumption — when memory frees up, the request resumes. If its blocks are still in cache (cache hit), part of the recompute is skipped; otherwise full prefill recompute occurs.

Performance impact. Preemption guarantees robustness (no OOM) but at a significant cost. It disproportionately affects p99 metrics: most requests complete normally, but preempted ones see latencies far above the average. During recompute, even non-preempted requests experience increased ITL because the GPU executes a compute-bound prefill job among their decode steps. This asymmetry is why offloading benefits manifest specifically on tail latency.

2.3 The Three Offloading Mechanisms

The investigation identified three distinct mechanisms, often confused under the generic term “KV cache offloading.”

A) LMCache — External Prefix Cache

LMCache operates as an external cache of pre-computed KV pairs, organized by token prefix hash. It is not integrated into vLLM’s scheduler.

B) Native OffloadingConnector — Preemption Recovery

Introduced in vLLM v0.11.0 (RFC #19854), the OffloadingConnector is a native component of vLLM’s KV Transfer framework. Its function is to save KV blocks to CPU RAM, enabling reload instead of recompute when a request is preempted.

C) FlexGen / InfiniGen / KVSwap — Full KV Swap

Research frameworks that implement full KV cache swap (and sometimes model weight swap) between GPU, CPU and disk. Designed for offline throughput scenarios on very large models. Not integrable as a vLLM serving backend. Outside the scope of this study.

Mechanism Comparison

Characteristic LMCache Native Offload FlexGen/InfiniGen
Type Prefix cache Preemption recovery Full KV swap
vLLM integration External plugin Native (v0.11.0+) Separate framework
Store trigger Every completed prefill Proactive (continuous mirroring) Always (by design)
Load trigger Prefix hash match Preempted request rescheduled Every attention step
Requires shared prefixes Yes (essential) No No
Requires preemption No Yes (essential) No
Modifies admission control No No N/A (own scheduler)
Optimal scenario RAG, multi-turn, system prompt High concurrency, bursts Offline batch, large models

3. Experimental Setup

Hardware and Software

Component Specification
GPU NVIDIA RTX 4070 Super (12 GB GDDR6X)
GPU interface PCIe 4.0 x16 (~21 GB/s bidirectional)
CPU RAM 32 GB DDR4
CPU Intel i7 (8 performance cores)
vLLM 0.15.1 (V1 architecture)
Model Qwen3-8B-AWQ (Marlin quantization)
Benchmark GuideLLM 0.5.3
Observability Prometheus + Grafana + DCGM Exporter

Workload

All tests use the same GuideLLM workload: concurrent mode, rate 10 req/s, 300 seconds per run. Prompt: 4096 input tokens + ≤512 output tokens (synthetic, random, no shared prefixes). The high input/low output ratio ensures deterministic pressure on the KV cache during prefill, while keeping requests active long enough to observe saturation during the decode phase.


4. Phase 1: LMCache — The Prefix Cache Misunderstanding

The initial hypothesis was that LMCache operated as an overflow mechanism: when the GPU runs out of KV space, blocks move to CPU RAM, allowing the scheduler to admit more requests.

Analysis of the official documentation revealed a completely different picture. LMCache does not modify the admission control logic. The scheduler continues to calculate maximum concurrency based exclusively on available GPU blocks. LMCache intervenes only after admission, to avoid re-prefilling tokens already computed when a subsequent request shares a prefix (hash match).

Since GuideLLM generates random prompts without shared prefixes, LMCache never finds a cache hit in this workload. This explains why results with LMCache enabled vs. disabled were virtually identical in previous tests (not published here).


5. Phase 2: Native OffloadingConnector

5.1 First Test — Conservative Configuration (Failed)

The first test used the same conservative configuration as the LMCache tests: gpu-memory-utilization=0.75, max-model-len=8192. Result: performance identical to vanilla. Host RAM rose from 5.5 GB to 13.6 GB (confirming that blocks were being reserved), but TTFT, throughput and latency remained unchanged.

5.2 The Scheduler Paradox

Analysis of the startup logs revealed the root cause. With the conservative configuration, vLLM reported a maximum theoretical concurrency of 1.36x for 8192-token requests. The scheduler admitted at most 2 simultaneous requests out of 10 concurrent submissions.

For the conservative configuration:

kv_cache = (12 GB × 0.75) − 5.7 GB (weights) − 0.6 GB (CUDA graph) − overhead ≈ 1.5 GB

max_concurrency ≈ 1.36x

The intuition “provide less VRAM for the KV cache to force the offload” produced the opposite effect: less VRAM → fewer blocks → fewer requests admitted → less contention → no preemption → offloading never activates. Static resource scarcity prevents the dynamic conflict needed to activate offloading.

5.3 The Solution: Dynamic Over-subscription

The correct strategy was the opposite: maximize admitted requests so that the KV cache saturates during generation, forcing preemptions. Three changes applied simultaneously:

5.4 The Optimistic Scheduler

A crucial result: the V1 scheduler admitted up to 7 simultaneous requests, exceeding the theoretical maximum by 4–5x. This was validated against vLLM’s technical documentation.

The max_concurrency reported at startup is the worst-case planning value — the scenario where every request simultaneously occupies max_model_len. The scheduler uses on-demand allocation: physical KV blocks are allocated only when needed, not pre-allocated for max_model_len at admission. The admission check verifies current actual occupancy:

admit_request = (blocks_needed_now ≤ blocks_free)

This enables over-subscription: requests arrive at different phases, and the scheduler sees available blocks based on each request’s current (not maximum) occupancy. When all requests generate tokens simultaneously, the KV cache grows in parallel and exceeds the physical budget — triggering preemption. This is exactly the scenario the OffloadingConnector is designed for.

Grafana data confirms: 5.9 running requests on average, max 7 (vs. 2.1 in the conservative configuration), with KV cache at 98.9% (vanilla) and 99.9% (offload).


6. Benchmark Results

6.1 Conservative Configuration

gpu-memory-utilization=0.75, max-model-len=8192, CUDA graphs enabled

Metric Vanilla Native Offload Delta
Requests completed 63 63 =
Running requests (avg) 2.1 1.9 -10%
Average TTFT (ms) 36,040 37,017 +2.7%
Output tok/s 107.1 104.9 -2.1%
Host RAM (GB) 5.51 13.59 +8.08 GB
KV cache % max 87.2% 81.9%  

Offloading is active (RAM rises by 8 GB) but produces no measurable benefit. The scheduler admits only 2 requests; no dynamic pressure on the KV cache, preemptions are rare.

6.2 Aggressive Configuration

gpu-memory-utilization=0.95, max-model-len=5120, –enforce-eager

Metric Vanilla Aggr. Offload Aggr. Delta
Requests completed 103 103 =
Running requests (avg) 5.9 5.5 -7%
Average TTFT (ms) 12,202 13,352 +9.4%
Median TTFT (ms) 16,325 17,857 +9.4%
Average request latency (s) 27.60 27.75 =
Output tok/s 175.4 173.5 -1%
Throughput req/s 0.34 0.34 =
Average ITL (ms) 30.1 28.2 -6.5%
ITL p99 (ms) 328.9 253.0 -23%
E2E latency p99 (s) 38.60 35.59 -7.8%
KV cache % max 98.9% 99.9%  
Host RAM (GB) 4.81 12.98 +8.17 GB
ITL p50 (ms) 17.7 38.2 +116%

Average throughput is identical, but tail latency improves: ITL p99 drops 23% (329 → 253 ms), E2E p99 drops 7.8%. Preempted requests recover faster via PCIe reload compared to full recompute. Average TTFT worsens by 9.4%, reflecting the continuous mirroring overhead of the connector.

6.3 Configuration Impact on the Scheduler

Metric Vanilla Conserv. Vanilla Aggr. Delta
Requests completed 63 103 +63%
Running requests 2.1 5.9 +181%
Average TTFT (ms) 36,040 12,202 -66%
Output tok/s 107.1 175.4 +64%
Request latency (s) 44.3 27.6 -38%

The aggressive configuration produces dramatic improvement across all metrics regardless of offloading. On consumer hardware with limited VRAM, vLLM parameter tuning has an orders-of-magnitude greater impact than enabling offloading.


7. Lessons Learned

The term “offloading” is misleading. It is used interchangeably for at least three architecturally different mechanisms: prefix caching, preemption recovery, and full KV swap. This ambiguity generates incorrect expectations, especially among infrastructure engineers who associate “offload” with memory swap/overflow.

Admission control governs everything. The vLLM scheduler is the primary gatekeeper. All offloading mechanisms operate downstream of admission. No currently available mechanism in vLLM bypasses admission control to admit more requests than the GPU can physically handle.

PCIe 4.0 is the limiting factor. At ~21 GB/s, the CPU→GPU transfer cost for reloading KV blocks is comparable to the recompute cost. The benefit manifests only on tail latency, not on average throughput. On hardware with faster interconnects (e.g. NVLink C2C at 900 GB/s on Grace Hopper/Blackwell systems), we would expect the ratio to shift substantially in favor of offloading, although this remains untested in our environment.

Parameter tuning surpasses offloading. The move from conservative to aggressive configuration produced +63% completed requests and -66% TTFT — without any offloading enabled. On consumer hardware, base parameter optimization has a far greater impact than experimental features.

The value of negative results. Demonstrating that a feature does not produce benefits in a specific scenario is a valid contribution. Few documents in the LLM inference community critically analyze the real limits of offloading on consumer hardware. Most benchmarks and guides assume datacenter hardware (H100, A100) with radically different bandwidth and memory characteristics.


8. Conclusions

LMCache operates as an external prefix cache, not as swap. Useful for workloads with shared prefixes (RAG, multi-turn) but invisible with random prompts.

The Native OffloadingConnector implements true preemption recovery offloading with asynchronous, layer-by-layer architecture. On RTX 4070 Super with PCIe 4.0, it reduces tail latency (ITL p99 by 23%, E2E p99 by 7.8%) but does not improve average throughput.

The vLLM scheduler is the dominant factor. With conservative configuration, no offloading mechanism produces effects. With aggressive configuration, the offloading benefit is real but marginal compared to the improvement from parameter tuning alone.

Recommendation for consumer hardware operators: before investing time in configuring offloading mechanisms, optimize base vLLM parameters (gpu-memory-utilization, max-model-len, enforce-eager). The benefit will be orders of magnitude greater. Native offloading becomes relevant only after maximizing scheduler concurrency, and its real value is destined to grow as hardware evolves toward faster CPU-GPU interconnects.

Future Work


Appendix A — Input/Output Length Choice

Input length is fixed at 4096 tokens; output is limited to ≤512 tokens (max_tokens). The model may produce fewer tokens due to EOS or other stop conditions.

Every token — input or output — contributes equally to KV cache size. In this benchmark, most of the KV cache derives from the input because prompt length (4096) far exceeds maximum generation length (512). This design ensures deterministic and immediate pressure on the KV cache during prefill, while keeping requests active long enough to observe decode phase dynamics.

This configuration (long input, moderate output) is standard in LLM serving benchmarks. It enables clear observation of KV cache saturation, scheduler behavior, and preemption events. Increasing output length would amplify these effects without modifying the fundamental system dynamics.

Appendix B — Unexplored Parameters and Known Limitations

kv_load_failure_policy. By default, failed KV block loads from CPU fall back to local recompute. Setting it to fail could isolate the pure connector effect by eliminating the recompute safety net. Not recommended in production but potentially useful for cleaner benchmark data.

Chunked Prefill tuning (max_num_batched_tokens). Controls how many tokens are processed per prefill step. Smaller values improve ITL but worsen TTFT, directly interacting with the OffloadingConnector’s work to reinsert preempted requests. Not varied in this benchmark.

CUDA Graphs with reduced max-model-len. An alternative to –enforce-eager: re-enable CUDA graphs while further reducing max-model-len to compensate. CUDA graphs improve decode throughput; this configuration could produce higher overall throughput while still triggering frequent preemptions. The trade-off between decode throughput and preemption frequency was not explored.

NIXL backend. vLLM supports the NixlConnector for highly efficient cross-process KV transfers. Relevant for disaggregated prefill/decode multi-instance configurations, outside the scope of this single-GPU study.

Hardware-specific applicability. The RTX 4070 Super with PCIe 4.0 sits at a specific bandwidth tier where transfer cost and recompute cost are approximately comparable. On hardware with lower bandwidth (PCIe 3.0), we would expect offloading to become counterproductive as transfer costs exceed recompute costs. On hardware with substantially higher bandwidth (NVLink C2C, Grace Hopper), we would expect offloading benefits to increase significantly. These are projections based on bandwidth ratios; they have not been experimentally validated.

Study Limitations

Results should not be automatically extended to NVLink platforms, Grace Hopper, Blackwell, or to multi-turn workloads with shared prefixes.

The practical conclusion for consumer hardware operators is clear: optimize the scheduler first, then evaluate offloading. Without the first step, the second risks remaining technically active but operationally marginal.