How LLM Pricing Really Works: Token Economics vs Compute Time

When enterprises start scaling Generative AI, the first surprise often comes from the bill. On paper, closed LLM APIs like OpenAI’s GPT-5 or Google’s Gemini 2.5 Flash look affordable: you pay per token. But once you push into millions of documents or images, token economics can balloon into six-figure monthly invoices.

By contrast, self-hosting open-source models works differently. Instead of paying per token, you pay for compute time. This means your costs are tied to how long it takes the GPU to process your workload. At small volumes, it may feel more complex and infrastructure-heavy. But at scale, the economics flip - self-hosting often turns out to be far cheaper and more predictable.

Let’s break down the cost dynamics with a practical example.

The Case Study: Extracting Data from 1 Million Documents

Imagine you are executing 1,000,000 inferences, each with about 10,000 input tokens and 3,000 output tokens. This is representative of workloads such as unstructured data extraction, where models must interpret long PDFs, process scanned images, or parse detailed forms into structured fields.

Cost of inference with closed source models

Closed APIs charge separately for input and output tokens. Here’s how it looks when applied to our use case:

Model	Input Tokens ($/1M)	Output Tokens ($/1M)	Cost per Document	Total Cost (1M docs)
gpt-5	$1.25	$10.0	$0.0425	$42,500
gemini-2.5-flash	$0.3	$2.5	$0.0105	$10,500

The math is straightforward: multiply token consumption by cost per million tokens. For GPT-5, each inference costs about 4 cents, which sounds tiny until you multiply by a million. Gemini is cheaper, but even then the total runs into five figures.

Cost with Open Source Model

Inference platform - Runpod
GPU - NVIDIA A40
Inference speed: 45 tokens/sec assuming a fine-tuned 14B model
Time per document: ~66 sec (~0.018 hr)
Total inference time: 18,518 hrs
GPU cost/hr: $0.40
Total cost: $7,407

Here, your total bill is under $8K, a fraction of the cost of GPT-5 and still meaningfully cheaper than Gemini 2.5 Flash.

Cost Savings

Open source vs GPT-5: ~5.7× cheaper
Open source vs Gemini Flash: ~1.5× cheaper

This simple comparison shows how the economics shift dramatically as volumes rise.

Why Enterprises Care About Data Residency and Privacy

While cost is the most visible factor, it’s not the only reason enterprises are pivoting to open-source. The regulatory landscape is tightening, and compliance is now a board-level issue:

EU AI Act (effective Aug 2025): Classifies certain AI use cases as “high-risk” with non-compliance fines up to €35M or 7% of global turnover.
India’s Digital Personal Data Protection Act (2025): Requires strict data localisation and explicit consent for cross-border transfers.
71% of CIOs already cite data privacy as the single biggest blocker to scaling GenAI initiatives (Deloitte GenAI Survey 2025).

When you send data to a closed API, you don’t control where it’s stored, who has access, or how it’s logged. Even if the provider is compliant, your organisation may be liable. In industries like healthcare, finance, or government, that risk is unacceptable.

With a self-hosted model, all data stays within your perimeter—on your cloud, your servers, or even on-prem infrastructure. This isn’t just a compliance checkbox; it reassures customers and regulators that sensitive data never leaves your control.

The Hidden Economics of Fine-Tuning

Another overlooked area is fine-tuning. Many enterprises want models customised for their terminology, domain, or workflows.

Closed models: Fine-tuning options are controlled by the vendor. Prices are opaque, and in some cases features are discontinued (e.g., when OpenAI removed fine-tuning for certain GPT versions). This locks you into their ecosystem.
Open models: Fine-tuning requires upfront compute investment but gives you full ownership of the weights. Once trained, the fine-tuned model can be reused across business units and workloads. Over time, this dramatically lowers marginal cost.

For CFOs, this means closed models are an ongoing operating expense with little reuse value, while open models are more like capital expenditure with long-term returns.

When Closed-Source APIs Still Make Sense

Despite the strong cost and compliance case for open-source, closed APIs retain clear advantages in certain situations:

Low-volume workloads: If you only need to process a few thousand documents per month, the absolute cost remains small, and the simplicity of APIs outweighs the complexity of running your own infra.
Rapid experimentation: If you are still in the proof-of-concept phase, closed APIs are faster. You can spin up an experiment in minutes without worrying about infrastructure, GPUs, or fine-tuning pipelines.
Advanced multimodal capabilities: Closed APIs often release cutting-edge features first, such as high-quality video or speech integration—that may not yet be available in open models.

In other words, closed APIs are excellent for speed and innovation at small scale. But as soon as workloads scale into millions of inferences, the economics and compliance calculus changes.

The Economics Flip at Scale

For large-scale, production workloads, self-hosted open-source models are usually more cost-efficient and compliant than closed APIs. The savings compound over time, especially as token prices increase and regulatory penalties become more severe.

For business leaders, the key question is no longer “What does this API cost per 1,000 tokens?” Instead, it’s “What does it cost to run my entire business on this model; financially, legally, and operationally?”

If you are planning to scale AI across millions of documents, run the numbers yourself. Compare token-based API pricing with compute-time costs of open-source. Don’t just consider today’s cost; model out three years of usage, regulatory risk, and fine-tuning strategy. In many cases, the path to lower bills and stronger compliance will lead you toward sovereign, open-source AI stacks.

LLM Pricing Explained: OpenAI vs Gemini vs Open-Source Models