Does token count differ for images or multimodal inputs?

Yes. Images are converted to a token count based on size. At OpenAI, a standard image costs 765 tokens using GPT-4o. At Anthropic, approximately 750-1,500 tokens per image. At Google, a standard image under 384x384 counts as 258 tokens. These are added to input token totals and billed at input pricing.

How do retries affect real token usage vs estimated usage?

Every retry is charged as a new API call with its full token count. At a 10% retry rate, actual token usage is 10% higher than successful-request count suggests. In agent workflows, a failure at step 8 of 10 that restarts from step 1 costs 18 steps instead of 10 -- an 80% cost multiplier on that run. Add 10-20% to token estimates for retry overhead.

What is the cost difference between RAG vs fine-tuning for reducing token costs?

RAG adds tokens per request (retrieved documents) but avoids fine-tuning costs and keeps knowledge current. Fine-tuning reduces tokens per request but costs $8-$25 per million training tokens upfront. For most applications, well-designed RAG with context caching is more cost-effective than fine-tuning, especially for dynamic knowledge bases.

Does context window size affect cost when I do not use the full window?

You only pay for tokens you actually include in a request, not for available context window size. A 1M token context window costs the same as a 128K window for a 5,000-token request. The larger window's value is having the option to include more content when needed without switching to chunking.

How accurate are published token pricing estimates for real workloads?

Published token prices are accurate; estimated usage is usually wrong. Most teams underestimate total token consumption by 2-3x because they do not account for system prompt overhead on every call, context accumulation across conversation turns, RAG retrieved content tokens, and retry overhead. Measure actual token usage on real requests before building a cost model.

Published: May 26, 2026 · Updated: May 27, 202618 min readAI Cost

Cost Per Token Explained: GPT vs Claude vs Gemini (2026)

Q: Why do token prices vary so much between models?

Token prices reflect inference cost, competitive positioning, and provider economics. Larger models require more compute per forward pass and cost more to serve. Budget models use architecture optimizations that reduce inference cost. Pricing also reflects market dynamics, with some providers pricing below cost to gain market share.

Q: How does tokenization work differently across GPT, Claude, and Gemini?

Each model uses its own tokenizer vocabulary. For English text, all three produce token counts within 10-15% of each other. Differences are more pronounced for code, non-English languages, and special characters. For accurate cost estimation, measure with the specific model's tokenizer rather than using a generic estimate.

Q: How do I reduce token costs without changing the model?

The four highest-impact changes: implement prompt caching for repeated context (saves 30-70% on input), trim conversation history to last 5-8 turns with a summary (saves 20-40%), reduce system prompt size by removing redundant instructions (saves 10-20%), and add explicit output length constraints (saves 15-30% on output). Together these can reduce total costs by 40-60%.

Q: How does batch processing reduce token costs?

Batch processing sends requests asynchronously with 24-hour results at lower per-token rates. OpenAI's Batch API costs 50% of standard pricing. The token count does not change; you pay the same tokens at a lower price. Applicable workloads include bulk analysis, nightly processing, large document summarization, and classification runs.

Q: How do I accurately estimate output token count before deploying?

Run your prompt on 50-100 representative real inputs and measure actual output token counts. Add explicit length constraints and measure the effect. Model the 90th percentile output length, not just the median -- long-tail responses drive average costs up significantly. Do not estimate based on your desired output length.

Cost Per Token Explained 2026: GPT vs Claude vs Gemini

Token pricing is the part of AI API documentation that confuses more people than any other. The numbers are small ($0.15 per million tokens), which makes them feel unimportant until the invoice arrives and the math turns out to be less friendly than expected.

The confusion usually comes from two places. First, most people do not have a clear sense of how many tokens their actual usage generates. “A million tokens” sounds like a lot until you realize that a moderately busy customer support chatbot can burn through several million tokens in a day. Second, the price comparison tables you find online present raw token prices without any context for what those tokens actually produce. $0.075/MTok sounds cheaper than $3.00/MTok until you realize one model requires five calls to complete a task that the other handles in one.

This guide gives you the complete picture: what tokens are, how they are priced, how much GPT, Claude, and Gemini charge at current rates, and how to calculate what your specific application will actually cost.

Quick Answer

If you only want the answer:

Model	Input cost	Output cost
Gemini Flash-Lite	Lowest	Lowest
Gemini Flash	Very Low	Very Low
GPT-4o Mini	Very Low	Very Low
Claude Haiku	Medium	Medium
GPT-4o	High	High
Claude Sonnet	Highest	Highest

For most businesses, Gemini Flash currently delivers the best cost-per-token value.

Key Takeaways

✓One token is roughly 0.75 words in English, or about 4 characters; "token" is not a word-count unit
✓Output tokens cost 4-10x more than input tokens at every major provider; most cost estimates underweight this
✓Gemini 2.5 Flash-Lite is the cheapest production-capable model in 2026 at $0.075 input / $0.30 output per million tokens
✓Claude 3.5 Sonnet is the most expensive mid-tier model at $3.00 input / $15.00 output per million tokens
✓Prompt caching can reduce input token costs by 50-90% for applications with repeated context
✓The cheapest model per token is not the cheapest model per task; output quality and retry rate affect the real cost per useful result
✓Most teams underestimate token usage by 2-3x because they do not account for context accumulation, system prompt size, and RAG retrieved content
✓Batch API options at OpenAI (50% cheaper) and Anthropic apply to asynchronous workloads and can halve costs on non-real-time tasks

On This Page

1.What is a token?
2.Why AI companies charge per token
3.How much does one million tokens cost?
4.GPT token pricing explained
5.Claude token pricing explained
6.Gemini token pricing explained
7.GPT vs Claude vs Gemini token cost comparison
8.Real cost scenario 1: customer support chatbot
9.Real cost scenario 2: AI content generation
10.Real cost scenario 3: AI agent workflow
11.Why businesses miscalculate token costs
12.Cost per useful output
13.How to reduce token costs
14.Token cost calculator framework
15.Which model offers the best token value?
16.One-minute token cost audit
17.Quick answers
18.Frequently asked questions

What is a token?

Direct Answer

A token is a chunk of text that an AI language model reads and generates. In English, one token is roughly 0.75 words or about 4 characters. “Token” is a technical unit specific to how models process text, not a word count or character count.

Language models do not read text the way humans do, word by word. They process text as sequences of tokens, where tokens are the result of splitting text according to a vocabulary that the model was trained with. Common short words are single tokens. Longer or rare words often split into multiple tokens.

Text example	Words	Approximate tokens
"Hello"	1	1
"Hello, how are you?"	5	6
"API"	1	1
"anthropomorphization"	1	4
"The quick brown fox jumps"	5	6
A 500-word blog paragraph	500	~667
A 1,000-word article	1,000	~1,333
A 10-page PDF document	~5,000	~6,700
GPT-4's context window (128K)	~96,000 words	128,000
Gemini's context window (1M)	~750,000 words	1,000,000

Visual comparison showing how words break into tokens for different text types across AI models

The practical conversion rate:

✓1,000 tokens is approximately 750 words
✓1 million tokens is approximately 750,000 words -- roughly the combined length of the Harry Potter series
✓A typical customer support exchange (question + answer): 300-800 tokens
✓A short blog post (800 words): approximately 1,067 tokens
✓A detailed GPT system prompt: 500-2,000 tokens

Token counting is slightly different across providers because different models use different vocabularies. GPT, Claude, and Gemini each have their own tokenization scheme. In practice, the differences are small (within 10-15%) for standard English text. For code, non-English languages, or heavy use of special characters, token counts can vary more significantly.

Use the Vortenza AI Token Counter to measure the exact token count for your specific prompts across different models before estimating costs.

Why AI companies charge per token

Direct Answer

AI companies charge per token because tokens directly map to the compute they use. Every token your model reads or generates requires the model to process that token through its neural network layers, which uses GPU time and memory. You pay for what you use.

The cost of running an inference (generating an AI response) scales with the number of tokens involved. A 100-token response uses roughly half the compute of a 200-token response on the same model. So per-token pricing is the most accurate way to bill for actual compute consumption.

Why output costs more than input

Output tokens (tokens the model generates) cost significantly more than input tokens (tokens in your prompt). The reason is that generating each output token requires the model to run a full forward pass through the neural network -- it is computationally intensive. Reading input tokens can be done more efficiently through parallel processing.

At GPT-4o: input is $2.50/MTok, output is $10.00/MTok -- a 4x difference. At Claude 3.5 Sonnet: input is $3.00/MTok, output is $15.00/MTok -- a 5x difference. This asymmetry is why output-heavy applications (content generation, long-form answers) cost much more per interaction than input-heavy applications (classification, short Q&A).

Chart showing the input-to-output token cost ratio across GPT-4o, Claude Sonnet, and Gemini Flash models

Context windows and cost

Every token in your context window costs money on every request. A system prompt of 1,000 tokens gets charged as input on every API call. If you make 100,000 requests per month, that system prompt alone costs 100 million tokens in input. At Gemini Flash pricing ($0.15/MTok), that is $15/month just for the system prompt. At Claude Sonnet pricing ($3.00/MTok), it is $300/month. Prompt caching reduces this to near zero for cached content.

How much does one million tokens cost?

One million tokens is the standard pricing unit across all major AI providers. Here is what each model charges as of June 2026.

Direct Answer

Prices range from $0.075 per million input tokens (Gemini 2.5 Flash-Lite) to $75.00 per million output tokens (Claude 3 Opus). For typical applications, expect $0.15-$3.00 per million input tokens and $0.60-$15.00 per million output tokens.

Model	Input / MTok	Output / MTok	Context window	Best use case
GPT-5	$5.00+	$20.00+	128K+	Most demanding tasks
GPT-4o	$2.50	$10.00	128K	Complex reasoning, multimodal
GPT-4o mini	$0.15	$0.60	128K	High-volume simple tasks
Claude 3.5 Sonnet	$3.00	$15.00	200K	Coding, complex writing, analysis
Claude 3.5 Haiku	$0.80	$4.00	200K	Production chatbots, mid-tier tasks
Claude 3 Opus	$15.00	$75.00	200K	Research, highest capability
Gemini 2.5 Pro	$1.25	$5.00	1M	Long documents, reasoning
Gemini 2.5 Flash	$0.15	$0.60	1M	Most production applications
Gemini 2.5 Flash-Lite	$0.075	$0.30	1M	Bulk processing, classification
DeepSeek V3	$0.27	$1.10	64K	Cost-sensitive tasks

What does $1 of tokens buy you?

✓At Gemini 2.5 Flash ($0.15 input / $0.60 output): $1 buys approximately 6.7 million input tokens or 1.7 million output tokens.
✓At Claude 3.5 Sonnet ($3.00 input / $15.00 output): $1 buys approximately 333,000 input tokens or 67,000 output tokens.
-At Sonnet pricing, you get about 20x fewer tokens per dollar than at Flash pricing.

GPT token pricing explained

Direct Answer

OpenAI charges $2.50 per million input tokens and $10.00 per million output tokens for GPT-4o. GPT-4o mini costs $0.15 input / $0.60 output. Both support 128K context windows. Cached input tokens cost 50% less for eligible requests.

GPT-4o pricing

GPT-4o is OpenAI's primary production model. At $2.50/$10.00 per million tokens, it sits in the mid-premium tier. The 4x output multiplier ($10.00 vs $2.50) means output-heavy applications cost proportionally more.

OpenAI automatically caches input tokens for requests where the same prompt prefix has been seen recently. Cached input tokens cost $1.25/MTok (50% off). This applies automatically -- you do not need to implement caching manually. The practical effect: for applications that repeatedly use the same long system prompt, input costs are often lower than the stated $2.50/MTok.

GPT-4o mini pricing

Mini sits at $0.15 input / $0.60 output -- identical pricing to Gemini 2.5 Flash. For OpenAI ecosystem users who want low cost, mini is the natural choice. Tool and function calling implementation in mini is mature and well-tested.

Batch API pricing

OpenAI's Batch API processes requests asynchronously (results within 24 hours) at 50% off standard pricing. GPT-4o via batch costs $1.25 input / $5.00 output per million tokens. For non-real-time workloads (overnight processing, bulk analysis), batch pricing effectively makes GPT-4o comparable to mid-tier models.

Example cost calculations:

1,000 tokens input + 500 tokens output (typical chatbot exchange):

GPT-4o: $0.0025 + $0.005 = $0.0075
GPT-4o mini: $0.00015 + $0.0003 = $0.00045

At 100,000 conversations/month (2,000 input + 1,500 output tokens each):

GPT-4o: $200 input + $1,500 output = $1,700/month
GPT-4o mini: $30 input + $90 output = $120/month

See the full OpenAI pricing breakdown in the OpenAI API Pricing 2026 guide.

Claude token pricing explained

Direct Answer

Anthropic charges $0.80 input / $4.00 output per million tokens for Claude 3.5 Haiku, and $3.00 input / $15.00 output for Claude 3.5 Sonnet. Both support 200K context windows. Prompt caching reduces cached input costs to $0.08/MTok for Haiku and $0.30/MTok for Sonnet -- a 90% discount on cached content.

Claude 3.5 Haiku

Haiku is Anthropic's cost-optimized model. At $0.80/$4.00, it is roughly 5x more expensive on input than Gemini Flash but noticeably better at following complex, multi-part instructions. For businesses where chatbot quality directly affects outcomes (sales conversations, nuanced customer support), Haiku's quality edge over Flash often justifies the premium.

Claude 3.5 Sonnet

Sonnet is the expensive option in the mid-tier. $3.00/$15.00 makes it the most expensive model outside the frontier tier (GPT-5, Claude Opus) for typical production use. The premium is justified when output quality is critical enough that errors are costly: production code generation, customer-facing writing that gets used without editing, complex multi-step analysis.

Anthropic's prompt caching

Anthropic's caching is the most aggressive of any major provider. Cached reads on Haiku cost $0.08/MTok (90% off standard input). Cached reads on Sonnet cost $0.30/MTok (also 90% off). Cache writes cost 25% more than standard input (one-time cost to establish the cache). The minimum cacheable block is 1,024 tokens. For applications with large repeated system prompts or document contexts, this brings input costs down dramatically.

Example cost calculations:

A Claude Sonnet chatbot handling 100K monthly conversations (3,200 input tokens per conversation, 500 cached from system prompt, 2,700 not cached, 2,400 output tokens):

Cached input: 50M tokens x $0.30/MTok = $15
Non-cached input: 270M tokens x $3.00/MTok = $810
Output: 240M tokens x $15.00/MTok = $3,600
Total: $4,425/month (vs $4,560 without caching)

For an application where 80% of input is repeated system context: caching reduces input cost by roughly 72%, changing the calculation significantly.

See full details in the Claude API Pricing 2026 guide.

Gemini token pricing explained

Direct Answer

Google charges $0.075 input / $0.30 output for Gemini 2.5 Flash-Lite, $0.15/$0.60 for Gemini 2.5 Flash, and $1.25/$5.00 for Gemini 2.5 Pro (under 200K context). All three models share a 1M token context window. Google offers the most generous free tier of any major provider.

Gemini 2.5 Flash-Lite

Flash-Lite is the cheapest capable model from any major provider: $0.075/$0.30 per million tokens. Half the cost of Flash. The 1M token context window at this price is unusual -- most budget-tier models have smaller context windows. Flash-Lite handles classification, structured extraction, and simple Q&A well. It produces more errors on complex reasoning and nuanced instruction following.

Gemini 2.5 Flash

Flash at $0.15/$0.60 is the best-value production model in 2026 for most workloads. Same 1M context window as Pro at one-eighth the cost. Suitable for customer support, content generation, RAG applications, and most agentic use cases.

Gemini 2.5 Pro

Pro at $1.25/$5.00 (under 200K context) or $2.50/$10.00 (over 200K context) is appropriate for complex reasoning, advanced code generation, and long-document analysis. The context surcharge above 200K is an important nuance: if your application regularly passes prompts above 200K tokens, input cost doubles.

Context caching

Gemini's caching requires a minimum of 32,768 tokens (32K) to activate. Cached storage costs $1.00/MTok/hour. Cached reads cost approximately 25% of standard input pricing. For large repeated contexts (knowledge bases, long documents), caching can reduce input costs by 75%.

Free tier

Gemini 2.5 Flash offers 15 requests per minute and 1 million tokens per day at no cost. For side projects, internal tools, and early MVPs, the free tier often covers all usage.

Example cost calculations:

Same chatbot (100K conversations, 3,200 input + 2,400 output):

Gemini Flash-Lite: $24 input + $72 output = $96/month
Gemini Flash: $48 input + $144 output = $192/month
Gemini Pro: $400 input + $1,200 output = $1,600/month

See full Gemini pricing in the Gemini API Pricing 2026 guide.

GPT vs Claude vs Gemini token cost comparison

Feature	GPT-4o	GPT-4o mini	Claude 3.5 Sonnet	Claude 3.5 Haiku	Gemini Flash	Gemini Flash-Lite
Input / MTok	$2.50	$0.15	$3.00	$0.80	$0.15	$0.075
Output / MTok	$10.00	$0.60	$15.00	$4.00	$0.60	$0.30
Context window	128K	128K	200K	200K	1M	1M
Caching discount	50% auto	50% auto	90% (explicit)	90% (explicit)	75% (explicit)	75% (explicit)
Free tier	Minimal	Minimal	No	No	Yes (1M tokens/day)	Yes (most generous)
Reasoning quality	Very Good	Good	Excellent	Good	Good	Fair
Instruction following	Very Good	Good	Excellent	Very Good	Good	Fair-Good
Speed	Medium-Fast	Fast	Medium	Fast	Fast	Fastest
Best for	Broad capability	High-volume simple	Complex quality work	Quality production	Most apps	Bulk processing

The context window gap is significant

GPT-4o and mini max out at 128K tokens. Claude models support 200K tokens. Gemini models support 1 million tokens. For applications that need to process long documents without chunking, or maintain very long conversation histories, Gemini's context advantage is real and has cost implications beyond the per-token price. A 500K-token document processed in one Gemini Flash call costs $0.075. Processed through multiple GPT-4o calls with chunking and aggregation, the per-token cost and engineering overhead are both higher.

The caching difference matters for production

Anthropic's 90% discount on cached reads is the most aggressive caching discount available. For applications where 50%+ of input tokens are repeated context (common in agentic systems, RAG applications, and chatbots with large knowledge bases), the effective input cost on Haiku drops to roughly $0.08-$0.40/MTok blended, which is competitive with Flash pricing.

Real cost scenario 1: customer support chatbot

Setup: 100,000 conversations per month. Average 8-turn conversation: 400 tokens input + 300 tokens output per turn = 3,200 input + 2,400 output tokens per conversation. Monthly tokens: 320M input + 240M output.

Model	Input cost	Output cost	Total/month	Annual
Gemini Flash-Lite	$24	$72	$96	$1,152
Gemini Flash / GPT-4o mini	$48	$144	$192	$2,304
Claude 3.5 Haiku	$256	$960	$1,216	$14,592
Gemini 2.5 Pro	$400	$1,200	$1,600	$19,200
GPT-4o	$800	$2,400	$3,200	$38,400
Claude 3.5 Sonnet	$960	$3,600	$4,560	$54,720

With Anthropic caching on Haiku (500-token system prompt): ~16M tokens at $0.08/MTok + 304M at $0.80/MTok = $244/month vs $256 without caching. The savings are modest here because the system prompt is a small share of total input. For applications with larger cached context, the savings are substantial.

Real cost scenario 2: AI content generation

Setup: 50 blog posts per day (1,500/month). Each post: 1,500 tokens input + 3,500 tokens output. Monthly tokens: 2.25B input + 5.25B output. Output-dominated workload: output tokens are 70% of the token cost.

Model	Input cost	Output cost	Total/month	Annual
Gemini Flash-Lite	$169	$1,575	$1,744	$20,928
Gemini Flash / GPT-4o mini	$338	$3,150	$3,488	$41,856
Claude 3.5 Haiku	$1,800	$21,000	$22,800	$273,600
GPT-4o	$5,625	$52,500	$58,125	$697,500
Claude 3.5 Sonnet	$6,750	$78,750	$85,500	$1,026,000

Content generation is where the output cost asymmetry hits hardest. Output tokens are 3,500 per article vs 1,500 input -- 70% of total tokens. Choosing Gemini Flash instead of Claude Sonnet for content generation at this volume saves $82,012/year. Test your specific brief format on both before committing at scale; for most structured content with detailed prompts, the quality difference is small.

Real cost scenario 3: AI agent workflow

Setup: A 10-step research agent running 10,000 tasks/month. Each task: average 15 LLM calls total (mix of planning, execution, and validation). Average per-call: 2,000 tokens input + 800 tokens output. Monthly tokens: 300M input + 120M output.

Note: this is 15 calls per task, not 1. That is the agent multiplier.

Model	Input cost	Output cost	Total/month	Annual
Gemini Flash-Lite	$22.50	$36	$58.50	$702
Gemini Flash / GPT-4o mini	$45	$72	$117	$1,404
Claude 3.5 Haiku	$240	$480	$720	$8,640
GPT-4o	$750	$1,200	$1,950	$23,400
Claude 3.5 Sonnet	$900	$1,800	$2,700	$32,400

The 15-call multiplier per task is the number that most teams miss. A task that looks like it will cost $0.01 at $0.15/MTok on Flash (with 2,000 input tokens) actually costs $0.15+ when the full 15-call execution is accounted for. For agent workloads, model routing pays off most: use Flash-Lite or Flash for simple planning and validation steps, and a more capable model only for the 2-3 steps that require complex reasoning.

See detailed agent cost analysis in the AI Agent Cost Breakdown 2026.

Three scenario cost comparison charts showing input and output token costs by model for chatbot, content generation, and agent workloads

Why businesses miscalculate token costs

Direct Answer

Token cost estimates are almost always too low because they count only the visible prompt and response, ignoring system prompt size, context accumulation, RAG retrieved content, tool call processing, and retry overhead.

Output inflation

The most common single mistake: estimating output length by what you want, not what the model generates. If you ask a model for a “brief summary” and the model produces 800 tokens when you expected 200, your cost is 4x what you estimated. Measure actual output length on 50 real requests before projecting costs. Most teams find their output estimates are 30-50% low.

Context accumulation

In a multi-turn chatbot, every turn includes all prior turns in the context. A 10-turn conversation does not cost 10x a single turn; it costs considerably more because turn 10 includes the full history of turns 1-9. A conversation with 400 tokens per turn balloons from 400 tokens in turn 1 to 4,000 tokens of input by turn 10. Average input cost is roughly 3x the per-turn estimate.

System prompt overhead

A 2,000-token system prompt adds 2,000 tokens to every single API call. At 100,000 daily calls, that is 200M tokens per day in system prompt overhead alone. At Gemini Flash: $30/day. At Claude Sonnet: $600/day. This overhead is invisible if you only count the user message.

RAG content

Applications using RAG retrieve 3-10 relevant documents and include them in the context with every request. Each retrieved document might be 500-2,000 tokens. At 5 documents x 1,000 tokens each = 5,000 tokens of context per request. This is often 3-5x the user message length and is not in the initial estimate.

Retry overhead

Models that fail or produce incorrect outputs get retried. A 10% retry rate adds 10% to your actual token usage vs your planned usage. Over a year, this compounds.

Tool calls

Agents and function-calling applications add tokens for tool call formatting, tool outputs, and the model's processing of tool results. Each tool call round trip typically adds 500-2,000 tokens to the exchange.

Cost per useful output

Direct Answer

The cheapest model per token is not always the cheapest model per completed task. A model that costs 20% more per token but requires 40% fewer retries costs less per useful result.

This framing matters most for three types of applications.

Classification and extraction

If you are classifying support tickets, extracting structured data, or labeling content at scale, your cost is determined by the number of correct classifications, not the number of API calls. A model at $0.30/MTok input that classifies correctly 95% of the time costs $0.00016 per correct classification (assuming 500 input tokens). A model at $0.075/MTok that classifies correctly 70% of the time costs $0.000107 per API call but $0.000153 per correct classification once you retry failures. The cheaper model is close to the same price per correct result, and worse in quality.

Code generation

Claude 3.5 Sonnet at $3.00/$15.00 generates code that developers accept without modification perhaps 80% of the time on typical tasks. GPT-4o mini at $0.15/$0.60 generates code that requires significant editing perhaps 50% of the time. If a developer costs $100/hour and review takes 10 minutes per accepted generation, the time cost per failed generation is $16.67. At 100 daily generations, the developer review cost on mini is $833/day. On Sonnet, it is $333/day. The API cost difference is small compared to the labor cost difference.

Customer support resolution

A chatbot that resolves support queries on the first response 70% of the time vs 85% of the time has a measurable difference in human escalation cost. Each escalation to a human agent costs $2-$5 in human time. The model quality differential translates directly to escalation rate difference, which translates to monthly human cost.

How to reduce token costs

Optimize prompt length

✓Audit system prompts for redundancy; typical first-draft prompts are 30-40% longer than necessary
✓Replace multi-sentence instructions with examples (usually fewer tokens, better results)
✓Remove repeated instructions that restate the same point differently
✓Expected savings: 15-30% on input tokens

Control output length

✓Add explicit output length constraints to prompts
✓Use structured outputs (JSON, specific format templates) to reduce preamble
✓Output tokens cost 4-10x more than input; shorter outputs reduce costs directly
✓Expected savings: 20-40% on output costs

Implement prompt caching

✓Cache system prompts, knowledge base documents, and other repeated context
✓Anthropic: 90% off cached reads; Google: 75% off; OpenAI: 50% automatic
✓Expected savings: 30-70% on input costs for applications with large repeated context

Reduce context accumulation

✓Summarize conversation history every 5-8 turns instead of passing full history
✓Trim RAG retrieved content to relevant sections rather than full documents
✓Set explicit context budget limits per conversation
✓Expected savings: 20-40% on context-heavy applications

Route by model tier

✓Use Flash-Lite or GPT-4o mini for classification, routing, and simple formatting
✓Use Flash or GPT-4o mini for standard conversational and content tasks
✓Reserve Haiku, Sonnet, or GPT-4o for tasks where quality directly affects outcomes
✓Expected savings: 40-70% when 60-80% of queries go to cheap model tiers

Use batch APIs for offline workloads

✓OpenAI Batch API: 50% off GPT-4o pricing
✓Anthropic batch API: significant discounts for asynchronous workloads
✓Applies to: bulk analysis, overnight processing, content generation queues
✓Expected savings: 40-50% on applicable workloads

For a broader model-by-model cost comparison, see our LLM Cost Comparison 2026 guide.

Token cost reduction checklist showing the five key optimizations and their expected savings percentages

Token cost calculator framework

The formula for estimating monthly token costs:

Monthly LLM Cost =
  (Daily Input Tokens x 30 x Input Price / 1,000,000)
  + (Daily Output Tokens x 30 x Output Price / 1,000,000)

Daily Input Tokens =
  (calls/day) x (system prompt tokens
              + avg user message tokens
              + avg context tokens)

Daily Output Tokens =
  (calls/day) x (avg response tokens)

Worked example: SaaS customer support chatbot

3,000 conversations per day
System prompt: 800 tokens (cached after first call)
Average user message + history: 1,200 tokens
Average response: 400 tokens
Model: Gemini 2.5 Flash ($0.15 input / $0.60 output)

Daily input tokens: 3,000 x 1,500 = 4.5M

Daily output tokens: 3,000 x 400 = 1.2M

Monthly input cost: 4.5M x 30 x $0.15/MTok = $20.25

Monthly output cost: 1.2M x 30 x $0.60/MTok = $21.60

Total monthly: $41.85

At 30,000 conversations/day (10x scale): ~$418/month.

The numbers to measure before estimating:

☐Actual token count in your system prompt (use Vortenza AI Token Counter)
☐Average token count in user messages across 50-100 real examples
☐Average token count in model responses across 50-100 real examples
☐Average conversation context length at your typical turn depth

Use the Vortenza AI Prompt Cost Estimator to paste real prompt examples and get accurate token counts and cost projections across multiple models simultaneously.

Which model offers the best token value?

Use case	Recommended model	Input/MTok	Output/MTok	Why
Startup / MVP	Gemini Flash (free tier)	$0.15 (free to start)	$0.60 (free to start)	Free tier covers early usage; upgrade path clear
High-volume chatbot	Gemini Flash-Lite + Flash routing	$0.075-$0.15	$0.30-$0.60	Cheapest reliable option; route complex queries to Flash
Quality production chatbot	Claude 3.5 Haiku	$0.80	$4.00	Better instruction following than Flash; 90% caching discount
Content generation at scale	Gemini Flash	$0.15	$0.60	Best output-per-dollar; 4x cheaper output than Haiku
Complex coding work	Claude 3.5 Sonnet	$3.00	$15.00	Acceptance rate justifies premium over cheaper models
Long document processing	Gemini Flash (1M context)	$0.15	$0.60	1M context eliminates chunking at Flash pricing
Enterprise AI agents	Claude Haiku + Flash routing	$0.15-$0.80	$0.60-$4.00	Route by step complexity; Haiku for reasoning, Flash for simple steps
Batch / async processing	OpenAI Batch API (GPT-4o)	$1.25	$5.00	50% off standard; mature API ecosystem
Classification / extraction	Gemini Flash-Lite	$0.075	$0.30	Cheapest capable model; handles structured tasks well

The honest summary: Gemini Flash wins on pure per-token value for most workloads. Claude Haiku wins when instruction following quality justifies 5x the price. Sonnet wins when the output goes directly to users or downstream systems without human review. GPT-4o mini wins for OpenAI ecosystem teams who want Flash-level pricing with OpenAI tooling familiarity.

One-minute token cost audit

Use before launching any AI application or when an invoice is higher than expected.

Understanding your token structure

☐Do you know your current split between input tokens and output tokens per request?
☐Is your output share above 60% of total token cost? (Optimize output length)
☐Have you measured actual output lengths on 50+ real requests, or are you estimating?

Identifying the big inputs

☐How many tokens is your system prompt?
☐How many tokens of conversation history do you pass per request?
☐If you use RAG, how many tokens of retrieved content per request?
☐Have you measured all of these with Vortenza AI Token Counter?

Caching status

☐Is prompt caching implemented for your system prompt?
☐Is RAG document context being cached where applicable?
☐Are you taking advantage of OpenAI's automatic caching?

Model routing

☐Are you using the same model for all request types regardless of complexity?
☐What percentage of requests could be handled by a 5-10x cheaper model?

Quick answers

Optimized for ChatGPT, Gemini, Perplexity, Claude, and Google AI Overviews.

Q: What is a token in AI?

A: A token is a chunk of text that an AI language model processes. In English, one token is roughly 0.75 words or 4 characters. "Hello" is one token. "Anthropomorphization" is about 4 tokens. AI companies charge per token because tokens directly correspond to the compute required to process or generate that text.

Q: How many tokens is 1,000 words?

A: 1,000 words is approximately 1,333 tokens in English. The general conversion is 1 word = 1.33 tokens, or 750 words = 1,000 tokens. This varies slightly by model and language. Non-English text and code can tokenize differently. Measure your specific content with a token counter for accurate cost estimates.

Q: How much does one million tokens cost?

A: It ranges from $0.075 per million input tokens (Gemini 2.5 Flash-Lite) to $15.00 per million output tokens (Claude 3.5 Sonnet). In between: Gemini 2.5 Flash at $0.15 input / $0.60 output, GPT-4o mini at $0.15 input / $0.60 output, Claude 3.5 Haiku at $0.80 input / $4.00 output, and GPT-4o at $2.50 input / $10.00 output.

Q: Why do output tokens cost more than input tokens?

A: Generating each output token requires the model to run a full forward pass through its neural network -- computationally intensive work done serially, one token at a time. Reading input tokens can be done in parallel and is less expensive per token. At GPT-4o, output tokens cost 4x more than input tokens ($10 vs $2.50/MTok). At Claude 3.5 Sonnet, output costs 5x more ($15 vs $3.00/MTok).

Q: Is GPT-4o more expensive than Claude?

A: GPT-4o ($2.50/$10.00 per million tokens) is slightly cheaper than Claude 3.5 Sonnet ($3.00/$15.00) on both input and output. Claude 3.5 Haiku ($0.80/$4.00) is more expensive than GPT-4o mini ($0.15/$0.60). For raw token pricing, GPT-4o mini and Gemini 2.5 Flash are the cheapest production-capable options from their respective companies.

Q: Is Gemini cheaper than GPT-4o?

A: Yes, significantly. Gemini 2.5 Flash costs $0.15/$0.60 per million tokens, approximately 17x cheaper than GPT-4o on both input and output. Gemini 2.5 Flash-Lite at $0.075/$0.30 is 33x cheaper on input than GPT-4o. Gemini 2.5 Pro at $1.25/$5.00 is approximately 2x cheaper than GPT-4o.

Q: What is prompt caching and how does it reduce token costs?

A: Prompt caching stores a snapshot of your repeated prompt content (system prompt, document context) in the model's memory. Subsequent requests using the same cached prefix are charged at a much lower rate. Anthropic charges 10% of standard input price for cached reads. Google charges approximately 25%. OpenAI automatically caches qualifying inputs at 50% off. For applications with large repeated context, caching is typically the single largest cost reduction available.

Q: How many tokens are in a typical chatbot conversation?

A: A typical customer support chatbot exchange (question + response) uses 300-800 tokens. A complete 8-turn conversation uses roughly 5,000-7,000 tokens total because context accumulates -- each turn includes all prior turns in the context. For cost estimation, use 5,600 tokens as a reasonable average for an 8-turn customer support conversation.

Q: How do I estimate token costs for my application?

A: Measure your actual usage: count tokens in your system prompt, measure average user message length, measure average model response length (not your target length -- the actual length). Multiply calls per day by those averages, then apply the per-token price. Add 20% for retries and overhead. Use Vortenza's AI Prompt Cost Estimator to paste real prompts and get accurate counts across multiple models.

Q: What is the cheapest GPT model?

A: GPT-4o mini is OpenAI's cheapest production model at $0.15 per million input tokens and $0.60 per million output tokens. It supports a 128K context window and handles most chatbot and content generation tasks competently. For asynchronous workloads, GPT-4o via the Batch API costs $1.25 input / $5.00 output -- cheaper per token than mini on input but with 24-hour result latency.

Q: What is the cheapest Claude model?

A: Claude 3.5 Haiku is the cheapest current Claude model at $0.80 per million input tokens and $4.00 per million output tokens. With Anthropic's prompt caching, cached reads on Haiku cost $0.08 per million tokens -- 90% off standard input pricing. Claude's cheapest option is significantly more expensive than Gemini Flash or GPT-4o mini.

Q: What is the cheapest Gemini model?

A: Gemini 2.5 Flash-Lite is the cheapest at $0.075 per million input tokens and $0.30 per million output tokens. It also has the most generous free tier. Gemini 2.5 Flash at $0.15/$0.60 is the better choice for applications where Flash-Lite's quality limitations are a problem.

Q: How does context window size affect token costs?

A: A larger context window does not by itself cost more. You pay for the tokens you actually use in each request. But a larger context window allows you to pass more text in a single request without chunking, which can reduce the total number of API calls needed for long-document tasks. Gemini's 1M context window at Flash pricing lets you process entire long documents in one call for $0.15 per million input tokens -- cheaper and simpler than multiple GPT-4o calls with aggregation.

Q: What is the token cost difference between GPT-4o vs GPT-4o mini for the same task?

A: GPT-4o costs $2.50 input / $10.00 output per million tokens. GPT-4o mini costs $0.15 input / $0.60 output. Mini is 16-17x cheaper per token. For a chatbot with 100,000 monthly conversations averaging 5,600 tokens each, GPT-4o costs approximately $3,200/month and mini costs approximately $192/month. The cost difference is $3,008/month, or $36,096 annually.

Q: Do AI companies charge for the system prompt on every request?

A: Yes. Every token in your system prompt is charged as input on every API call. A 1,500-token system prompt sent with 100,000 daily requests costs 150M input tokens per day. At Gemini Flash pricing, that is $22.50/day or $675/month for the system prompt alone. Prompt caching eliminates most of this cost by storing the system prompt and charging for reads at 25-50% of standard input price.

Frequently asked questions

What is the difference between input tokens and output tokens in pricing?+

Input tokens are the tokens your prompt, system message, context, and retrieved documents use. Output tokens are the tokens the model generates in its response. Input and output tokens are billed separately at different rates. At every major provider, output tokens cost more than input tokens -- typically 3-10x more. The practical implication: applications that generate long responses are output-dominated in cost, while applications that pass large context with short responses are input-dominated. Most cost estimates underweight output because people focus on the prompt size and underestimate response length.

Why do token prices vary so much between models?+

Token prices reflect the model's inference cost (how much compute each token requires to generate), the provider's competitive positioning, and their overall economics. Larger, more capable models cost more to run because they have more parameters and require more compute per forward pass. Frontier models (GPT-5, Claude Opus) are genuinely more expensive to serve. Budget models (Flash-Lite, GPT-4o mini) use architecture and training optimizations that reduce inference cost significantly. Pricing also reflects market dynamics: Google prices Gemini Flash below cost in some regions to gain market share.

How does tokenization work differently across GPT, Claude, and Gemini?+

Each model uses its own tokenizer vocabulary. GPT models use tiktoken with a vocabulary of roughly 100K tokens. Anthropic uses a different tokenization scheme. Google uses SentencePiece. The practical differences are small for English text -- all three produce counts within 10-15% of each other for the same content. Differences are more pronounced for code (especially uncommon languages), non-English languages, and text with many special characters or technical notation. For accurate cost estimation in these cases, measure with the specific model's tokenizer rather than using a generic estimate.

What is the real-world average token usage per day for a mid-size application?+

A mid-size SaaS with 1,000 active daily users using an AI-assisted feature typically generates 50M-500M tokens per day, depending on the feature. A customer support chatbot at 5,000 conversations/day with 5,600 tokens per conversation uses 28M tokens per day. A content generation tool producing 100 articles/day at 5,000 tokens per article uses 500M tokens per day. At Gemini Flash pricing: $0.15/MTok input and $0.60/MTok output, those ranges represent $4.20-$75/day in LLM costs.

How do I reduce token costs without changing the model?+

The four highest-impact changes that do not require changing your model: implement prompt caching for repeated context (saves 30-70% on input), trim conversation history to the last 5-8 turns with a summary of earlier turns (saves 20-40%), reduce system prompt size by removing redundant instructions (saves 10-20% on input), and add explicit output length constraints to prompts (saves 15-30% on output). Together these can reduce total token costs by 40-60% without touching the model.

Does the token count differ for images or multimodal inputs?+

Yes. Images are converted to a token count based on image size and resolution. At OpenAI, a standard image (under 512x512) costs 765 tokens using GPT-4o. At Anthropic, images are converted at approximately 750-1,500 tokens per image. At Google, a standard image under 384x384 counts as 258 tokens. These image token counts are added to the input token total and billed at the same input token price. For image-heavy applications, model the image token cost explicitly.

How does batch processing reduce token costs?+

Batch processing sends API requests asynchronously and accepts results within 24 hours rather than in real time. OpenAI's Batch API costs 50% of standard pricing -- GPT-4o via batch costs $1.25/$5.00 per million tokens instead of $2.50/$10.00. Anthropic offers batch discounts on asynchronous workloads. The token count does not change; you pay the same number of tokens but at a lower per-token price. Applicable workloads: bulk content analysis, nightly data processing, large document summarization, embedding generation, classification runs.

What is the minimum context size for Anthropic's prompt caching?+

Anthropic requires a minimum of 1,024 tokens to establish a cache checkpoint. Content below this threshold cannot be cached. For most production applications with meaningful system prompts and knowledge base context, this threshold is easy to exceed. The cache is established on the first request (charged at a write cost of 25% more than standard input) and reads on subsequent requests cost 10% of standard input price. Cache entries last for approximately 5 minutes without access and can be refreshed by any request that accesses them.

How do retries affect my real token usage vs estimated usage?+

Every retry is charged as a new API call with its full token count. If your application retries failed requests automatically and 10% of requests fail and retry once, your effective token usage is 10% higher than your successful-request count suggests. If a complex agent step fails at step 8 of 10 and retries from step 1, you pay for 18 steps instead of 10 -- an 80% cost multiplier on that run. Most cost estimation models ignore retry overhead. Add 10-20% to your token estimates to account for retries unless you have measured your actual retry rate.

What is the cost difference between using RAG vs fine-tuning for reducing token costs?+

RAG (Retrieval-Augmented Generation) adds tokens to each request (the retrieved documents), but avoids the cost of fine-tuning and keeps the model up to date with changing information. Fine-tuning reduces tokens per request (less need for detailed instructions in the system prompt) but costs $8-$25 per million training tokens one-time plus ongoing inference costs. For most applications, well-designed RAG with caching of the retrieved context is more cost-effective than fine-tuning, especially for dynamic knowledge bases that change regularly.

How do I accurately estimate output token count before deploying?+

Run your prompt on 50-100 representative real-world inputs and measure the actual output token counts. Do not estimate based on your desired output length -- models often produce more tokens than the minimum needed to answer a question. Add explicit output length constraints to your prompts and measure the effect. The distribution of output lengths often has a long tail: most responses are short, but 5-10% are very long and drive average cost up significantly. Include the 90th percentile output length in your cost model, not just the median.

What is the per-token cost for embeddings vs inference?+

Embedding models (used for semantic search and RAG) are priced separately from inference models and are much cheaper. OpenAI's text-embedding-3-small costs $0.02 per million tokens. Google's text-embedding-004 costs approximately $0.025 per million tokens. Anthropic does not offer a dedicated embedding model. Embedding costs are typically small relative to inference costs for most applications, but for knowledge bases requiring frequent re-embedding (content that changes regularly), embedding costs add up.

How does context window size affect my cost when I do not use the full context?+

You only pay for the tokens you actually include in a request, not for the available context window size. A model with a 1M token context window does not cost more than one with 128K unless you actually send 1M tokens in a request. The advantage of a larger context window is that you can include more content when needed without switching to a chunking approach. Gemini's 1M context at Flash pricing means you have the option to process very long documents cheaply when your application needs it.

Can I reduce costs by asking the model to be more concise?+

Yes. Adding explicit brevity instructions to your system prompt reduces output tokens directly. "Respond in under 100 words" is a reliable constraint that most models follow. "Be concise" is less reliable -- models interpret this differently. The most effective approach is explicit word or token limits in the prompt combined with structured output formats (JSON, specific templates) that constrain response shape. In testing, explicit length constraints typically reduce average output tokens by 25-40% compared to unconstrained prompts for typical Q&A and support tasks.

Why is Claude 3.5 Sonnet so much more expensive than other models?+

Claude 3.5 Sonnet is priced at the high end of its capability tier because Anthropic's positioning emphasizes quality over cost. Sonnet is genuinely better than comparable models on complex instruction following, coding, and nuanced writing tasks. Whether that quality advantage is worth paying $15.00/MTok for output when Gemini Flash charges $0.60/MTok depends entirely on whether your application's outcomes are measurably better with Sonnet quality. For many applications, they are not. For some -- production code generation, high-stakes customer communication, complex analysis -- the quality premium pays for itself in reduced editing and error correction costs.

Final Verdict

Cheapest model per token: Gemini 2.5 Flash-Lite at $0.075/$0.30 per million tokens. For applications where its quality level is acceptable (classification, extraction, structured Q&A), it is the default starting point.

Best value model: Gemini 2.5 Flash at $0.15/$0.60. The combination of Flash pricing, 1M context window, and good production quality handles the vast majority of real-world applications.

Best startup model: Gemini 2.5 Flash on the free tier, then paid Flash when you exceed daily limits. Zero cost to start, clear upgrade path, largest context window in the budget tier.

Best enterprise model: Depends on the task. Claude 3.5 Haiku with aggressive prompt caching for production chatbots. Claude 3.5 Sonnet for quality-critical work. Gemini Flash for high-volume, cost-sensitive applications. GPT-4o for multimodal and function-calling-heavy workflows.

The calculation that matters is cost per useful output, not cost per token. Before choosing a model based on price alone, measure your specific task accuracy rate across model candidates. Many teams estimate token usage before deployment using Vortenza's AI Token Counter to measure real prompt token counts and AI Prompt Cost Estimator to compare cost projections across GPT, Claude, and Gemini on their actual prompts. Most teams find their initial estimates are off by 2-3x once they measure real output lengths and account for context accumulation.

About this guide

Published by the Vortenza Editorial Team. Token pricing data sourced from OpenAI pricing page, Anthropic pricing page, and Google AI Studio pricing page as of June 2026. Tokenization examples use GPT-4o tokenizer as a reference; actual counts vary slightly by model. Verify current pricing at each provider before making financial decisions.

Tools used in this guide

AI Token Counter

Count tokens in your actual prompts by model before estimating costs. Free.

AI Prompt Cost Estimator

Paste your prompt and compare costs across GPT, Claude, and Gemini at current pricing. Free.

LLM Cost Comparison

Side-by-side cost comparison across all major models at your expected volume. Free.

OpenAI Cost Calculator

OpenAI-specific cost estimation by model and token volume. Free.