Dielabs — AI Infrastructure Engineering

About

The Lab

Dielabs investigates how architectures shape workload behavior in AI systems. Focused on the layers that make inference a service — runtime, serving stack, and hardware & fabric — and the system behavior they produce, from latency and throughput dynamics to capacity under load.

I design, benchmark, and validate inference architectures to map performance, trade-offs, and real-world limits — and turn these findings into opinionated frameworks for day 0–1 decisions: sizing, capacity planning, and workload deployment.

Built on a background in enterprise datacenter and presales engineering, now fully focused on GPU-accelerated inference stacks and the operational discipline required to run them at scale.

What I work on

Focus Areas

GPU Infrastructure

Homelab node, DCGM, driver stack

Inference Runtimes

vLLM internals, KV cache, batching

Benchmarking

GuideLLM sweeps, crossover analysis

Observability

Prometheus, Grafana, PromQL, DCGM

Memory Engineering

KV offload, LMCache, quantization

Distributed Inference

TP/DP patterns, NUMA, CPU-GPU isomorphism

Methodology

Five Frameworks for the Inference Lifecycle

From a business need to a system in production: a framework set that turns LLM inference deployment into a sequence of named, defensible decisions — each anchored in lab evidence, not vendor slides.

01

Anatomy

The LLM Inference Stack Model

Layered model of an LLM inference system, from physical hardware (L0) to the client (L6). Each layer enables the one above.

Read →

02

Sizing

From Idea to Production

Inference sizing in 11 steps. From business need to validated deployment. Separates customer inputs from architectural response. The conversation framework.

Read →

03

Competence