When does self-hosting an LLM become cheaper than an API?

Self-hosting is a mostly fixed cost — you pay for the GPUs whether they're busy or idle — while an API is purely variable, charged per token. Below the break-even volume the API wins because you only pay for what you use. Above it, the fixed GPU cost spreads over enough requests that self-hosting's cost per request drops below the API price. The break-even volume is the GPU monthly cost divided by the API cost per request.

What costs am I missing with self-hosting?

This compares raw GPU rental to API token cost. Real self-hosting also carries engineering time to set up and maintain serving, autoscaling and redundancy, idle capacity for traffic spikes, model and infrastructure updates, and on-call. Those make the true break-even higher than the pure-compute number, so treat this as the optimistic case for self-hosting.

Is cost the only thing that matters?

No. Data residency and privacy, latency control, avoiding vendor rate limits, and the ability to fine-tune can all justify self-hosting even below break-even. Equally, an API removes operational burden entirely. Use this as the cost input to a broader decision, not the whole decision.

Self-Host vs API Calculator

API baseline

Comparable API model

Input tokens / request

Output tokens / request

Self-host setup

GPU cost $ / hour each

GPUs

Throughput req / sec, full load

Utilization % busy

Your volume requests / month

USD. API prices reviewed —. GPU rent is the optimistic case — real self-hosting adds ops overhead.

The crossover

Break-even volume —

API / month

—

at your volume

Self-host / month

—

fixed GPU cost

Cheaper now

—

saving

GPU capacity / mo

—

requests

API scales with volume; self-host is flat. They cross at break-even.

Fixed vs variable

An API charges only for what you use, so it wins at low and spiky volume. GPUs cost the same idle or busy, so they only pay off once you keep them genuinely busy.

Utilization is everything

A GPU at 20% utilization costs five times as much per request as one at 100%. Spiky traffic that needs headroom for peaks is exactly where self-hosting economics get hard.

This is the optimistic case

Raw GPU rent ignores setup, redundancy, on-call, and updates. The real break-even sits higher than the pure-compute number — treat this as the floor, not the verdict.

About self-hosting vs API

When does self-hosting get cheaper?

Self-hosting is a fixed cost; an API is variable. Below the break-even volume the API wins; above it the fixed GPU cost spreads over enough requests to beat the per-token price. Break-even = GPU monthly cost ÷ API cost per request.

What costs am I missing?

This compares raw GPU rent to API tokens. Real self-hosting also means setup and maintenance, redundancy, idle headroom for spikes, updates, and on-call — so the true break-even is higher. Treat this as the optimistic case.

What throughput should I assume?

It depends on model size, GPU, batching, and request length. Use a measured number if you have one, and keep utilization realistic — sustained high utilization is hard with spiky traffic.

Is cost the only factor?

No — data residency, latency, rate limits, and fine-tuning can justify self-hosting below break-even, while an API removes ops burden entirely. Use this as one input, not the whole decision.

Is my data stored anywhere?

No. Everything runs in your browser; nothing is sent to a server.

Self-host vs API calculator

API baseline

Self-host setup

The crossover

About self-hosting vs API

Get new tools as they ship