When Good GPUs Produce Bad Tokens

Narrative Loops and Degenerative Decoding in vLLM

Qwen2.5-7B-Instruct-AWQ · Mistral-7B-Instruct-v0.3-FP8 · RTX 4070 Super 12GB · vLLM


Part 1 — Qwen2.5-7B: Phenomenon, Diagnosis and Framework

1. Infrastructure Context

The case was observed on a homelab node with the following characteristics:

The system was stable throughout the entire test duration. No memory saturation, no KV cache spikes, no CUDA errors, no OOM.

2. Observed Phenomenon

During a single long generation, the text began repeating itself. After a certain point the narrative entered a loop: phrases and structures replicated cyclically, with no technical error detected.

The phenomenon is NOT related to:

It is a degenerative decoding phenomenon — a statistical behavior of the model, not a hardware issue.

3. Technical Mechanism

3.1 Probabilistic Attractor

An LLM generates tokens by maximizing the conditional probability P(token_t token_1..t-1). When the text enters a repetitive pattern and there is no repetition penalty, the model can find a very high-probability sequence and remain trapped in it.

The autoregressive cycle closes as follows:

With repetition_penalty = 1.0 (default) there is no mechanism to penalize tokens already seen in the context. The loop becomes stable and persistent.

3.2 Hypothesis on Long-Sequence Degradation

Observed data: degradation emerges in the final portion of long generations, not at the beginning.

Hypothesis — the cause cannot be determined from the tests conducted. Plausible explanations, not mutually exclusive:

Determining which factor is dominant would require controlled tests with isolated variables — for example: same prompt without rigid structure, token-by-token probability distribution analysis, context length ablation.

3.3 Linguistic Drift as a Side Effect

Observed data: with repetition_penalty > 1.0 and no explicit language constraint, the text drifted toward English beyond ~5000 tokens. Adding the constraint in the system prompt eliminated the drift.

Hypothesis on the cause: when plausible alternatives in the current language are exhausted under the penalty effect, the model samples from alternative distributions. Plausible causes for why English emerges as an alternative include the pretraining corpus distribution, tokenizer efficiency for Italian, instruction tuning balance, and effective sampling temperature in that context region. It is not possible to isolate a single factor from the tests conducted.

Practical implication — regardless of the cause: repetition_penalty is not a neutral parameter. On multilingual models it acts on the overall linguistic distribution, not just on lexical repetition.

4. Test Results

4.1 Comparative Test: penalty 1.0 vs 1.2 (target 75 paragraphs)

Fixed parameters: default temperature (1.0), default top_p (0.9), no system prompt.

Parameter Test 1 — penalty 1.0 Test 3 — penalty 1.2
Loop onset Paragraph 16 No loop
Completion No (infinite loop) Yes — Paragraph 75 reached
Linguistic coherence IT stable (until loop) IT → EN switch from ~§66
Side effects Infinite loop, identical token by token Linguistic drift at the tail
Modified parameters None (all defaults) Only repetition_penalty

4.2 Stress Test: 150 paragraphs (penalty 1.2 + IT system prompt)

Added system prompt: “Rispondi esclusivamente in italiano. Non usare altre lingue.” The linguistic constraint eliminated the IT→EN drift, but tail degradation did not disappear — it manifested differently.

Zone Paragraphs Observed phenomenon
Phase 1 — Stable 1–63 Coherent IT, progressive narrative
Phase 2 — Language drift 64–83 Progressive IT→EN switch
Phase 3 — Collapse 84–86 Intra-paragraph loop, token flooding, complete degeneration

Critical finding: linguistic degradation begins when the generated context reaches approximately 5000–6000 tokens, regardless of the declared target in the prompt and well below the technical context window limit (32768 tokens). There exists therefore a practical coherence threshold for this model on this type of task, separate and independent from the technical context window.

5. Interpretive Schema: Three Distinct Planes

The analysis revealed that output quality and infrastructure state are two completely separate planes. A system can be perfectly stable — KV cache at 20%, GPU calm, scheduler regular — and still produce degenerative output. The most common diagnostic error is looking for the cause on the wrong plane.

The three-plane framework serves to quickly identify the correct level where the cause lies, before intervening.

5.1 System Plane

Concerns everything measurable at the hardware and orchestration level. It is the most visible plane: Grafana covers it entirely and problems manifest with clear signals — OOM, latency spikes, saturated KV cache, GPU at 100%.

Diagnostic question: is there pressure on physical resources or orchestration?

Metrics: GPU utilization, VRAM used, KV cache %, latency per token, throughput, scheduler queue depth, CUDA errors.

Typical solutions: reduce concurrency, increase VRAM, reduce max_model_len, optimize gpu_memory_utilization, update CUDA drivers.

In the documented case: KV cache at 20%, GPU stable, no errors. The System Plane was clean. The cause was elsewhere.

5.2 Model / Decoding Plane

Concerns the statistical behavior of the model during generation. Not visible on Grafana: requires observing the output directly and knowing the active sampling parameters. Problems here do not produce technical errors — they produce low-quality output: loops, incoherence, linguistic drift, hallucination.

Diagnostic question: is the anomalous behavior reproducible with a stable system? Are the sampling parameters configured correctly for the task?

Metrics: repetition rate in the output, narrative coherence, lexical diversity, entropy of generated tokens, output language.

Typical solutions: calibrate repetition_penalty, temperature, top_p; add stop sequences; set max_tokens; reinforce the system prompt with linguistic constraints and format instructions.

In the documented case: repetition_penalty = 1.0 (default) allowed the loop at paragraph 16. Raising it to 1.2 eliminated the loop. The problem was entirely on this plane.

5.3 Architecture / Context Plane

The subtlest plane: concerns model behavior as a function of generated context length. Problems here produce no technical errors and are not detectable from sampling parameters — they emerge only with long outputs, on a stable system with correct parameters.

Diagnostic question: does degradation emerge only beyond a certain output length, even with a stable system and correct parameters?

Metrics: number of tokens generated at the moment of degradation, position in the context window, type of degradation (syntactic compression, linguistic drift, intra-paragraph collapse).

Typical solutions: set a max_tokens hard limit below the empirically measured degradation threshold; break the task into multiple shorter requests; use models with superior training length for tasks requiring very long outputs.

In the documented case: with 150 paragraphs and correct parameters, degradation emerged around 5000–6000 generated tokens — well below the technical limit of 32768. The Architecture/Context Plane defined the practical task limit.

5.4 Summary Table of the Three Planes

System Plane Model/Decoding Plane Architecture/Context Plane
GPU utilization Repetition rate Position in context
KV cache % Token entropy Long-sequence behavior
vLLM scheduler Narrative divergence Type of observed degradation
Latency / throughput Sampling parameters Effective generated length

5.5 The Causal Chain: Planes Mask Each Other

The three planes are not independent problems: they can be overlapping and sequential. A problem on a higher plane can completely mask a problem on a lower plane, preventing its observation.

Phase 1 — Model Plane problem visible, Architecture Plane hidden. With repetition_penalty = 1.0, the loop emerged at paragraph 16 — approximately 800–1000 generated tokens. The test never reached enough length to stress the Architecture Plane. The third layer already existed, but was masked by the second.

Phase 2 — Model Plane resolved, Architecture Plane emerges. Raising the penalty to 1.2, the loop disappeared. The model generated freely up to 75, then 150 paragraphs. Only at that point did the Architecture Plane limit become observable: degradation at ~5000 tokens, independent of decoding parameters and infrastructure state.

Practical implication: after resolving a problem on any plane, do not declare the system resolved without explicitly verifying the underlying planes. Diagnosis must be sequential and complete.

5.6 Hypothesis: Sampling Entropy Collapse

A useful interpretive framework to explain the observed phenomena is Sampling Entropy Collapse: the tendency of the model’s probability distribution to progressively narrow as the generated sequence lengthens.

The hypothesized mechanism:

long sequence
↓
accumulated context constrains the distribution
↓
the distribution becomes increasingly narrow
↓
few tokens dominate with high probability
↓
loop or degeneration

This hypothesis is consistent with three test observations:

Why the penalty resolves the loop. The repetition_penalty artificially lowers the probability of previously seen tokens, reopening the sampling space. It does not correct the cause of the collapse — it circumnavigates it by forcing diversity.

Why the problem emerges late. Entropy does not collapse immediately: a sufficiently long sequence is needed for the distribution to narrow enough to trigger the loop. In the tests the threshold was ~800 tokens (without penalty) and ~5000 tokens (with penalty 1.2).

Why it depends on the task. A narrative task with rigid structure and progressive numbering constrains the space of valid sequences more than a Q&A or technical task. Fewer degrees of freedom = narrower distribution faster = earlier collapse.

This hypothesis has not been verified by directly measuring entropy during generation, and is not the only possible explanation. It remains a useful interpretive framework for practical diagnosis and for designing future tests.

6. Implications for Production Inference Engineering

6.1 Decoding Parameters

6.2 Quality Control Strategy for Production

Recommended priority order:

  1. First defense: max_tokens hard limit calibrated on the task + explicit stop sequences
  2. Second defense: repetition_penalty = 1.2 with linguistic constraint in the system prompt
  3. Monitoring: quality metrics separate from infrastructure metrics

6.3 Observed Practical Limit

For Qwen2.5-7B-Instruct-AWQ on long narrative task in Italian:

7. The Coherence Threshold is a Property of the Task, Not the Model

The ~5000–6000 token threshold observed is not a fixed architectural limit of the model. It is an emergent property of the specific model–task–language combination tested: Italian prose narrative with rigid numbered structure.

7.1 Factors Influencing the Threshold

Text type: prose narrative requires causal and emotional continuity. Structured technical texts (documentation, Q&A, commented code) have more semantically independent sections — context pressure is lower and the threshold may rise.

Language: the tests showed that language influences behavior — the IT→EN drift is an observed fact. Plausible causes include pretraining corpus distribution, tokenizer efficiency for Italian, and instruction tuning balance. It is not possible to isolate a single factor from the tests conducted.

Formal prompt constraints: the prompt used in tests imposed rigid structure (progressive numbering, minimum paragraph length). Multiple formal constraints narrow the space of valid sequences, increasing the probability that the model finds a dominant attractor sooner. More open prompts tend to postpone the collapse.

7.2 Methodological Implication

There is no universal threshold valid for all tasks. In a production context, the coherence threshold must be measured empirically for each relevant combination of model, task and language. The correct procedure: define the target task and language, conduct stress tests with progressively longer outputs, empirically identify the degradation threshold, set a max_tokens hard limit below that threshold with an adequate safety margin.


Part 2 — Comparison with Mistral-7B-FP8: Deployment and Benchmark

8. Context and Motivation

In Case Study 1 a practical coherence threshold of ~5000–6000 generated tokens was documented on Qwen2.5-7B, independent of the technical context window and infrastructure state.

The next logical step is to verify whether that threshold is a specific characteristic of Qwen2.5-7B or a general behavior of models at that parameter count on that type of task. To do this, it is necessary to run the same tests on a different model, with identical hardware and decoding parameters.

Objective: determine whether a model with different characteristics produces a significantly different coherence threshold.

9. Model Selection

9.1 Selection Criteria

For a valid comparison the model must satisfy three criteria: same parameter size (7B), compatibility with available hardware (12GB VRAM), and characteristics sufficiently different from Qwen2.5 to make the comparison informative.

The selected model is neuralmagic/Mistral-7B-Instruct-v0.3-FP8, produced by NeuralMagic and specifically optimized for vLLM.

9.2 Technical Comparison

Parameter Qwen2.5-7B-Instruct-AWQ Mistral-7B-Instruct-v0.3-FP8
Quantization AWQ 4-bit FP8 8-bit
Size on disk ~4 GB ~7 GB
VRAM occupied ~4 GB ~7 GB
Residual VRAM for KV cache ~6 GB ~2.4 GB
Supported max_model_len 32768 tokens 19000 tokens (VRAM-constrained)
Quality vs full precision Greater degradation 65.85 vs 66.33 benchmark — minimal
CUDA kernel AWQ kernel CutlassFP8ScaledMMLinearKernel
Architecture family Qwen2ForCausalLM MistralForCausalLM

The main trade-off is quality vs VRAM footprint. FP8 better preserves the quality of the original model (minimal benchmark loss: 0.48 points) but occupies nearly twice the VRAM compared to AWQ 4-bit, reducing the serviceable context window on the same GPU.

9.3 Why FP8 is the Middle Ground

On 12GB VRAM the choice was between three options: AWQ 4-bit (~4GB), full precision BF16 (~14GB, not compatible), and FP8 (~7GB, compatible with margin). FP8 is the format recommended by NeuralMagic for production deployment with vLLM.

10. Troubleshooting: From First Error to Successful Startup

10.1 Sequence of Events

Step Event Corrective action
1 Initial configuration with max_model_len = 32768
2 Model download: 416 seconds (~7GB from HuggingFace) Expected — first startup downloads from HF Hub
3 ERROR: ValueError — Insufficient KV cache. Required 4.0 GiB, available 2.41 GiB for max_model_len 32768 Reduce max_model_len. vLLM suggests estimated maximum: 19696 tokens
4 Updated docker-compose: max_model_len = 19000 (safety margin) Container restart with new config
5 Second startup: model already cached, loading in 16 seconds Successful startup — Application startup complete

10.2 Error Analysis

The error that emerged at first startup demonstrates how the quantization format directly impacts VRAM available for the KV cache — and therefore the serviceable context window.

The causal chain: FP8 occupies 7GB of VRAM → 2.41GB remain free → serving max_model_len 32768 requires 4.0GB of KV cache → 2.41 < 4.0 → error.

vLLM autonomously provided the estimate of the maximum supportable value (19696 tokens), making the correction immediate.

Note: this is a concrete example of interaction between the System Plane and the Architecture Plane. The choice of quantization format (System Plane) determines the VRAM available for the KV cache, which in turn constrains the maximum context window (Architecture Plane). The planes are never completely independent.

11. Detailed Startup Log Analysis

vLLM startup logs contain dense diagnostic information. Below each relevant entry with its technical meaning.

Log entry Technical meaning
version 0.15.1 vLLM V1 engine active — rewritten internal architecture, more efficient for scheduling and memory than previous versions.
Resolved architecture: MistralForCausalLM vLLM identified the model family. Determines which CUDA kernel is selected and how attention operations are optimized.
quantization=fp8 / Selected CutlassFP8ScaledMMLinearKernel FP8 requires specific kernels for matrix multiplications. CutlassFP8 is the NVIDIA kernel optimized for Ada Lovelace architectures (RTX 4070 Super).
Using FLASH_ATTN attention backend FlashAttention selected automatically among 4 alternatives. Reduces memory usage during attention and increases speed — optimal choice for RTX 4070 Super.
Model loading took 7.01 GiB and 15.976617 seconds 15 seconds because the model was already in local cache. At first startup with download it had taken 429 seconds.
torch.compile takes 9.61 s in total vLLM compiles the computational graph with PyTorch Inductor. Result cached — subsequent executions use the cache.
Available KV cache memory: 2.41 GiB / GPU KV cache size: 19,696 tokens With 2.41 GiB remaining, vLLM can manage at most 19,696 total KV cache tokens. This is the physical constraint that required reducing max_model_len.
Maximum concurrency: 1.04x for 19,000 tokens With 19k token requests, the KV cache supports slightly more than one concurrent request.
Capturing CUDA graphs: 51/51 + 35/35 vLLM captures 86 CUDA graphs for batch sizes from 1 to 512 tokens. They eliminate CUDA scheduling overhead — improve decode latency. Cost: 0.47 GiB of additional VRAM.
Chunked prefill enabled, max_num_batched_tokens=2048 Prefill is split into 2048-token chunks. Prevents long prompts from monopolizing the GPU and blocking ongoing decode requests.
enable_prefix_caching=True KV cache reuse for identical prefixes across requests. With a fixed system prompt in tests, each new chat benefits from this.
Application startup complete. Engine ready to receive requests. This is the target line to look for in logs to confirm successful startup.

12. Docker-compose Configuration

12.1 Explicitly Declared Parameters

Parameter Value Meaning
–model neuralmagic/Mistral-7B-Instruct-v0.3-FP8 HuggingFace model identifier.
–max-model-len 19000 Maximum serviceable context window — reduced due to VRAM constraint.
–gpu-memory-utilization 0.85 85% of 12GB = 10.2GB allocated to vLLM.
–host / –port 0.0.0.0 / 8000 Network binding to reach the service from the host.
version: "3.9"
services:
  vllm:
    image: vllm/vllm-openai:latest
    container_name: vllm
    restart: unless-stopped
    ports:
      - "8000:8000"
    volumes:
      - vllm_models:/root/.cache/huggingface
    command:
      - "--model"
      - "neuralmagic/Mistral-7B-Instruct-v0.3-FP8"
      - "--host"
      - "0.0.0.0"
      - "--port"
      - "8000"
      - "--max-model-len"
      - "19000"
      - "--gpu-memory-utilization"
      - "0.85"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    ipc: host

12.2 Features Activated Automatically by vLLM

Everything not in the docker-compose is configured automatically by vLLM at runtime, based on the detected model and hardware.

Feature How and why it is activated automatically
FlashAttention backend vLLM detects the GPU and selects the most efficient attention backend. On RTX 4070 Super (Ada Lovelace), FlashAttention is always the optimal choice.
CutlassFP8 kernel Detected FP8 quantization, vLLM automatically selects the optimized CUDA kernel.
Chunked prefill Active by default in vLLM 0.15.x with V1 engine.
Prefix caching Active by default. Particularly useful with fixed system prompts.
torch.compile + Inductor V1 engine compiles the computational graph at first startup and caches the result.
CUDA graphs (86 graphs) Captured automatically for batch sizes from 1 to 512. Reduce decode latency.
Asynchronous scheduling Active by default in V1 engine. Decouples scheduling from GPU execution.

Key principle: the docker-compose is the minimum contract with vLLM. Everything else is automatic optimization. Adding unnecessary parameters introduces conflict risk and reduces readability without concrete benefits.

13. Comparative Test Results

Tests were run with the same parameters as Part 1: repetition_penalty = 1.2, numbered narrative prompt of 75 paragraphs, Italian and French languages.

13.1 Test A — 75-Paragraph Narrative in Italian

Collapse occurred at paragraphs 3–4, approximately 200–300 generated tokens. Paragraph 1 was comprehensible Italian but with syntactic anomalies at the tail. From Paragraph 3: fragmented Italian, invented neologisms, token flooding, insertion of Chinese characters. By Paragraph 4 the text was completely illegible.

Direct comparison: Qwen2.5-7B with the same penalty had maintained coherence up to paragraph 63 (~5000 tokens). The ratio is approximately 25:1 in favor of Qwen on this task.

13.2 Test B — 75-Paragraph Narrative in French

The hypothesis was that Mistral, developed by a French company, would perform significantly better in French. Result: collapse at §4 instead of §3 — one additional paragraph. The difference is minimal and insufficient to draw conclusions about language as a causal factor.

13.3 Test C — Short Technical Task in Italian (~500 words)

Markedly positive result: fluent Italian, coherent logical structure, correct coverage of key concepts. Some minor terminological inaccuracies but acceptable for a 7B on a popularization task. Zero degeneration, zero linguistic drift.

This demonstrates that the collapse observed in narrative tests is not a general limitation of Mistral — it is specific to the combination of model + long output + rigid structure.

13.4 Summary Table

Model Task Penalty Collapse Outcome
Qwen2.5-7B-AWQ IT narrative 75 § 1.2 §63 (~5000 tok) Complete, degrades at tail
Mistral-7B-FP8 IT narrative 75 § 1.2 §3–4 (~200 tok) Immediate collapse
Mistral-7B-FP8 FR narrative 75 § 1.2 §4 (~300 tok) Immediate collapse
Mistral-7B-FP8 IT technical task ~500 words 1.2 None (~400 tok) Coherent, acceptable quality

14. Comparison Conclusions

The threshold is model-specific — observed data. On this task, Qwen2.5-7B maintained coherence for ~5000 tokens. Mistral-7B collapsed within 200–300 tokens. 25:1 ratio. Same parameter count, same hardware, same decoding parameters — completely different results.

The causes of this difference cannot be isolated from the tests conducted. Candidate factors include: dataset and instruction tuning, sampling sensitivity, quantization differences (AWQ vs FP8), tokenizer. The test demonstrates the difference, not the cause.

Language as a factor — insufficient data. French postponed the collapse by one paragraph compared to Italian. The difference is minimal and does not allow concluding that language is a relevant causal factor in this comparison.

The threshold is task-specific — observed data. Mistral on a short technical task (~400 tokens) produces coherent and acceptable quality output. The collapse observed in narrative tests is not a general limitation of the model — it is specific to the combination of model + long output + rigid structure.

Implication for deployment: model selection must explicitly consider the type of task and expected output length. An English benchmark on generic tasks is not sufficient to predict behavior on specific tasks in different languages. The coherence threshold must be measured empirically for each relevant model-task-language combination.


Document produced as part of the AI Infrastructure lab — Dielabs Academy