GPU execution platform

INFERENCE FABRIC

Inference Fabric

The execution layer for autonomous work.

The fastest kernels on Earth are worthless behind a starved scheduler.

The GPU problem is solved. The scheduling problem is not. Inference Fabric saturates NVIDIA silicon by fixing the supply chain around the kernels, not the kernels inside it.

Become a design partner The thesis

Why we built it

AgentOS created a new workload class.

Inference Fabric was built because autonomous work produces a workload pattern that conversational inference stacks were never designed to handle: continuous, ragged, fleet-scale, and disconnected from any user waiting on a response.

The problem

Modern inference stacks optimize inference. They don't optimize workload supply.

The kernels are already world-class. The bottleneck moved. Real workloads arrive ragged, batches run half-empty, and the tensor cores you paid for sit idle waiting on the scheduler.

Symptom 01

Ragged workloads

Sequences of every length arrive out of order. The scheduler batches only what showed up in the window.

Symptom 02

Padding waste

Mismatched lengths pad up to the longest in the batch. Industry pipelines burn 15 to 40 percent of GPU time on zeros.

Symptom 03

Scheduler starvation

If admission averages 40 sequences on hardware that holds 512, the scheduler dutifully saturates 8 percent of the machine.

Symptom 04

Idle tensor cores

The most expensive silicon in the building waits on a queue that doesn't understand token geometry. SM utilization sits at 45 to 60 percent.

Four tools, four taxes. The serving engine batches only what arrived. The queue moves bytes and destroys token geometry. The orchestrator can observe a dollar but can't enforce one. The vector store serializes between storage and the kernel. Each tax compounds the others, and the point fixes do not compose.

The thesis

Same NVIDIA kernels. Different supply chain.

TensorRT-LLM

Inference Fabric

Saturated GPUs

We don't replace NVIDIA inference. We feed it.

What it is

A saturation harness. Not an inference engine.

▍ The kernels are NVIDIA's. The supply chain is ours.

What it is

A Rust + C++ saturation harness around NVIDIA's stock TensorRT-LLM kernels (FP8 weights, KV, and attention)
Token-budget feeding: admission, hysteresis, and pacing measured in tokens, not requests
Eight-level backpressure, bounded end to end
Zero-allocation hot path: pre-faulted pools, verified by Nsight
Sync discipline: compute-stream sync physically rejected
Fail-closed loader: nine required engine fields, no defaults

What it is not

No custom kernels. The kernels are a solved problem. NVIDIA's are excellent, and we delegate to them.
No custom graph capture. Marginal gain under 1 percent; delegated to TensorRT-LLM.
No tokenizer. Handled upstream by the producers.
No multi-model or multi-GPU in one process. One container, one GPU, one engine.
No from-scratch attention. NVIDIA's FP8 FMHA is the state of the art.

The contribution is what surrounds the kernels, not what replaces them. Anything NVIDIA does well, the Fabric delegates to. Anything NVIDIA doesn't do, the Fabric refuses to compete on. Discipline of scope is what earns the 22x ladder.

How it saturates

Three properties keep the tensor cores hot.

None of this is a kernel change. All of it is the engine artifact, the queues around it, and the threading model.

P1 · TOKEN-BUDGET FEEDING

Fed by tokens, not requests.

Admission measures the in-flight token budget; new sequences enter only when their padded-token count fits. High/low-water hysteresis prevents completion-rate collapse. The working set tracks GPU bandwidth, not a fixed batch size.

P2 · EIGHT-LEVEL BACKPRESSURE

Bounded everywhere. Blocking is the signal.

From broker socket to admission queue to the executor's pending list to the KV manager to the completion drain, every queue is bounded. A slow stage blocks the one above it. No oscillation, no tail-latency cliff, no mystery memory growth.

P3 · BLOCKING HOT PATH

The way the kernels were designed to be called.

Async drives admission and ingress at the edges; the inference call is a blocking thread pinned to the executor. No async on the GPU's path. The GPU sees a single, well-fed producer thread.

The supply chain, in motion

Ragged work in. Saturated GPUs out.

Producers submit work of every length and walk away. FEED groups it by shape into pad-minimal lanes, BURN keeps the tensor cores fed end to end, and completed results land when they are ready.

01 · SUBMIT & EXIT

Producers fire and forget

Apps submit work of every length and disconnect. No held connections, nothing waiting.

02 · PACK BY SHAPE

Uniform lanes, ~zero padding

Work is clustered by shape so the GPU only ever sees clean, homogeneous batches.

03 · SATURATE

Tensor cores stay hot

No idle time between batches. ~97% SM utilization, 250K tokens/sec on a single card.

04 · RETURN

Results land when ready

Completed work is drained back, with no work lost and no work duplicated.

Evidence

Two consumer cards, approaching data-center throughput.

Same silicon as everyone else. Different scheduler. These are live production numbers, not a benchmark harness.

250K

peak tokens / sec, single RTX 5090

160K+

sustained tokens / sec, single GPU

96.9%

mean SM utilization

99.9%

KV-cache hit rate

0.42%

padding waste (vs 15-40% industry)

▍ from live production telemetry

Stock TensorRT-LLM / vLLM

+ in-flight batching

1.3x

+ FP8 tensor cores

+ shape-pure batching

+ class-keyed KV reuse

+ saturated runtime (no idle GPU)

14x

+ max-TPS mode

22x

Every rung is an attributable engineering decision, measured on the same RTX 5090. 22x the throughput per GPU is roughly a 95 percent reduction in marginal cost per million tokens at saturation.

Where it stands

Not a competitor. A force multiplier.

vs vLLM and stock TensorRT-LLM

Same card. 22x the work.

~2x useful work → 22x peak · 14x sustained

vLLM is excellent for evaluation and small-team workloads, and remains the right tool there. For production inference at scale, the question is whether you can afford to leave roughly half the silicon idle. If your bill includes the GPU, the answer is no.

vs managed APIs

Your weights. Your hardware.

~95% lower marginal cost per million tokens at saturation

A managed API is the right answer until the bill grows linearly with your traffic. For inference workloads that are now the product, the math has flipped: your weights, your GPUs, no per-token markup, no egress, no rate limits.

Six guarantees

Engineered properties, not aspirations.

Each one removes a category of incident from your runbook.

G1 · SATURATION BY CONSTRUCTION

96.9% mean SM utilization, live.

The GPU is never waiting on the scheduler. The scheduler is always waiting on the GPU.

G2 · NO DRIFT

The artifact answers for itself.

Nine required engine-config fields, no defaults. Staging-vs-production drift is engineered out, not patched.

G3 · NO ALLOCATION ON HOT PATH

Pre-faulted pools, verified by Nsight.

Resident set is flat. The P99 tail is the kernel, never the allocator.

G4 · NO COMPUTE-STREAM SYNC

Sync is rejected at the stream.

Library-injected sync collapses the pipeline. The kernels can only be reached through the executor's API.

G5 · FAIL-FAST EXIT

Under 5-second replacement.

Any fatal error exits in under 5 seconds with ordered teardown. The provisioner replaces; the broker rewinds cursors.

G6 · ONE GPU, ONE ENGINE

No tenant sharing.

One container, one visible GPU, one engine. The ceiling is the GPU's hardware ceiling, not a software arbitration ceiling.

CUDA 13.1 · Blackwell SM 120

Built for the newest NVIDIA silicon. The runtime keeps NVIDIA's FP8 tensor cores saturated on Blackwell at the full multiplier; Hopper, Ada, and Ampere run the same codebase at roughly half. Alongside it, custom hand-written CUDA kernels compiled to native SM 120 run GPU analysis ahead of the off-the-shelf ecosystem.

Runs on any NVIDIA card · RTX 5090 / 5080 → L40S · H100 · H200 · B100 / B200 · one card or one hundred

Why this drives GPU demand

Conversational systems wait for users. Autonomous work doesn't.

AgentOS generates continuous autonomous work. Inference Fabric converts that work into GPU-efficient execution.
The more autonomous work exists, the more GPU demand exists.

The work layer

AgentOS

The autonomous work operating system. It generates sustained, continuous workloads instead of waiting on a human to type.

The execution layer

Inference Fabric

The GPU execution platform. It turns that continuous work into saturated NVIDIA silicon, on hardware you own.