It's 2 AM and my terminal is screaming error: context_length_exceeded. Again.
I was just testing a chatbot. Nothing complicated. A basic multi-turn conversation to see whether it actually remembered the system instructions properly. Then suddenly the whole script halts.
Like most developers, I completely ignored tokens until something broke in production. You do not really care about the plumbing until the API throws an error directly into your face.
That debugging session sent me down a weird token rabbit hole.
That is when the annoying part finally clicked: tokens are not just some abstract text metric hidden in API docs. They control how much context the model can actually hold and how much the API actually costs.
I paused the terminal and opened the billing dashboard.
I had confidently estimated this tiny app would cost around $12 a month to run.
The actual bill? $47. Just from local testing.
Turns out I had absolutely no idea how any of this worked under the hood. And honestly, most developers do not either until something fails badly enough that they are forced to learn it.
The word-to-token confusion that costs people money
It is basically a rite of passage to assume a token is just a fancy API word for "word."
Understanding the underlying technical parameters of tokenization can help you control your API bills. Key factors that determine your total context window cost include:
- Byte-pair encoding (BPE): The algorithmic method used to split text into subword units.
- Tiktoken: The open-source tokenizer library by OpenAI used to estimate token counts before sending requests.
- Out-of-vocabulary tokens: Unseen words or special symbols that must be broken down into individual characters, increasing counts.
- Prompt caching: Storing static system prompts to avoid repeated processing fees.
- Tokenization overhead: The extra tokens consumed when processing non-English languages or structured data like JSON.
You read the docs. You see the classic line:
100 tokens is roughly 75 English words.
Seems straightforward enough. Then your app bill shows up.
That estimate is exactly how developers convince themselves an app will cost twelve bucks a month and then somehow end up staring at a $47 invoice a few weeks later.
The problem is that the 75-word estimate becomes basically useless the second you stop using plain conversational English.
A token is not a word. A tiny word like "hi" might become a single token. But something longer like "unbelievable" gets split into multiple smaller pieces internally. You are paying for those slices, not the visible word itself.
Things get worse once code enters the picture.
The tokenizer does not read code the way humans do. Punctuation, symbols, brackets, indentation, special characters. All of that stuff eats through your limits fast.
I figured this out during a late debugging session while trying to understand why a relatively simple app suddenly started burning through the API budget and causing latency spikes. The prompts themselves were not longer. The users were not typing more.
The culprit was structured JSON output.
I had modified the system prompt so the model would return structured JSON instead of plain text because parsing the output was easier on my side. What I completely missed was how expensive JSON gets surprisingly fast.
The model does not just tokenize the actual data values. Every curly bracket, quotation mark, colon, and nested indent gets counted too. You are literally paying for formatting. You are paying the API to generate syntax and whitespace.
The actual useful content inside the JSON might only be fifty words, but all the surrounding structure quietly pushes the token count into the hundreds. That is exactly how apps start bleeding money without developers realizing why.
The context window mistake almost everyone makes

A lot of developers misunderstand how chatbot memory actually works. They constantly confuse token limits and context windows. Those are related concepts, but they are not the same thing.
You see a huge context number advertised on a pricing page and assume you are fine. Then you build a chatbot assuming the API only processes the newest user message while magically remembering everything else automatically.
That is not how these systems work. AI APIs are stateless.
They do not remember what happened five seconds ago unless you manually send the previous conversation back to them every single time. The context window is simply the total amount of information the model can "see" in one request. That includes the system prompt, the entire conversation history, and the new message. Everything.
It starts with the system prompt. Most developers write giant instruction blocks explaining how the assistant should behave. That chunk alone can consume a massive percentage of the available context before the user even says a single word.
Then the user types "Hello." You send the system prompt plus the user message. The model replies. Then the next user message comes in. You are not just sending the new question anymore. You are sending the system prompt, previous user messages, previous assistant replies, and the new question. Over and over again.
This is where costs quietly explode.
Every previous message gets re-sent to the API every single time a user hits enter. You are not paying once for the conversation. You are paying for message one, then message one plus two, then message one plus two plus three.
By the time somebody is twenty messages deep into a chatbot conversation, you are basically re-uploading an entire novel to the server just to answer a yes-or-no question. And most developers do not even realize this until something crashes.
I learned this during a GPT-4 code review session while refactoring a project. I was pasting multiple files into the conversation, debugging logic step-by-step, making real progress. Then suddenly my terminal threw context_length_exceeded. Session dead. No warning.
The context window had silently filled up with old files, previous iterations, and conversation history. Long conversations are not just an expense problem. They are a hard functional limit if you are not actively pruning conversation history.
Output tokens are where the real money disappears

Developers spend hours optimizing prompts.
You remove unnecessary examples. Minify instructions. Compress wording. Trim every token possible from the input prompt because you think smaller prompts automatically mean lower costs.
The annoying part is that most APIs charge separately for input tokens and output tokens. And output tokens usually cost more. That means your carefully optimized fifty-token prompt can still blow up your daily budget if the model responds with thousands of generated tokens.
I learned this while trying to build a cheap coding agent.
The idea sounded financially reasonable at first. I calculated the prompt costs and assumed the whole thing would run for basically nothing. Completely wrong. I spent weeks staring at billing dashboards trying to figure out where the leak was.
The leak was generated code. The model was producing massive outputs full of raw syntax, explanations, corrections, and rewrites. The output tokens destroyed my margins almost immediately.
This is why coding agents, long-form AI writing tools, and open-ended chatbots become expensive much faster than people expect. Generation costs dominate the economics. Output is where the real money disappears. And developers usually realize that only after they already paid the invoice.
Images make this worse too. People casually drag screenshots and diagrams into multimodal models without thinking about cost at all. But images consume tokens as well. You upload a PNG and suddenly your API balance drops way faster than expected.
The dangerous part is that developers think they are controlling costs by optimizing prompts. But the model controls the expensive part: the generation itself. If you are not capping output lengths, forcing concise responses, or limiting generation size, you are basically handing the API provider a blank check every time somebody sends a request.
How prompt caching reduces context window costs
To minimize the latency and cost of processing long inputs, modern APIs use prompt caching. Providers like Anthropic and OpenAI allow you to cache your system instructions, documents, and historical messages, charging a discounted rate for cached input tokens. This optimization helps you avoid paying the full input rate on repetitive context, making multi-turn chat applications and document-heavy workflows significantly cheaper.
Why GPT, Claude, and Gemini all count tokens differently

You eventually try comparing providers to save money. That is when everything gets messy.
You assume the same prompt should cost roughly the same everywhere. It does not. The exact same input can produce completely different token counts, context usage, and pricing depending on whether you run it through GPT, Claude, or Gemini.
There is basically no standardization across providers.
OpenAI uses tiktoken. Claude and Gemini use their own proprietary tokenizers internally. That means token estimates stop making sense the second you switch platforms. A lot of developers make the mistake of using OpenAI token estimates to calculate Claude costs. It is not reliable. At best, it gives you a rough directional estimate. The actual billing can still be wildly different. It feels like measuring something in inches and then getting billed in centimeters later.
Then you look at the context windows themselves:
- GPT-4o: approximately 128K tokens
- Claude 3.5 Sonnet: approximately 200K tokens
- Gemini 1.5 Pro: up to 1M tokens
And honestly, the giant context windows sometimes make developers worse. Because once people see huge limits, they start throwing huge amounts of context into prompts simply because the model technically allows it. Entire codebases. PDFs. Logs. Documentation dumps. The hard boundary disappears psychologically. The model processes it fine. Then the invoice arrives.
Bigger context windows do not automatically mean better cost-efficiency. A lot of developers overpay simply because they stop being careful once the limits get larger.
The counts do not align. The tooling does not align. The same prompt behaves differently everywhere. Every time a provider updates a tokenizer or releases a new model, developers end up recalibrating all their assumptions again.
How byte-pair encoding maps your text to numbers
AI models do not see text: they see arrays of integers representing token IDs. This mapping relies on byte-pair encoding (BPE), a subword tokenization algorithm that iteratively replaces the most frequent pairs of bytes in a text with a single new byte. For example, OpenAI models use the cl100k_base or o200k_base vocabularies via the tiktoken library. Common prefixes and suffixes get grouped into single tokens, whereas rare words, misspelled terms, or out-of-vocabulary tokens are split into individual characters or even byte-level pieces, which drastically increases token counts.
| Model | Context Window | Input Cost / 1M | Output Cost / 1M | Prompt Caching |
|---|---|---|---|---|
| GPT-4o | 128K tokens | $2.50 | $10.00 | Supported (50% discount) |
| Claude 3.5 Sonnet | 200K tokens | $3.00 | $15.00 | Supported (90% discount) |
| Gemini 1.5 Pro | 1M tokens | $1.25 | $5.00 | Supported (50% discount) |
For a detailed breakdown of what each model actually costs per token right now, the OpenAI API pricing guide covers the current numbers across providers. And if you want to compare token efficiency across GPT, Claude, and Gemini directly, the ChatGPT Plus vs Claude Pro vs Gemini comparison breaks down which model gives you more output per dollar.
How do you calculate the cost of AI tokens?
You calculate AI token costs by multiplying the number of input and output tokens by their respective rates per million tokens, then adding the two values together. For example, if a model charges $2.50 per million input tokens and $10.00 per million output tokens, a request with 800 input tokens and 200 output tokens will cost exactly $0.004.
This pricing structure exists because LLMs require more computational power to generate new text than to read existing text. Consequently, API providers charge a premium, typically three to five times higher, for output generation. To optimize these costs in production, developers rely on techniques like prompt caching and aggressive pruning of conversation history.
How to count tokens before you send

Stop trusting rough estimates. That "75 words equals 100 tokens" rule becomes unreliable the second you introduce code, JSON, or structured outputs. Three practical options exist for counting tokens before you send anything.
OpenAI's tiktoken library is accurate for GPT models, free, and runs locally. You can paste any prompt and get an exact token count before it touches the API. The limitation is that it is only reliable for OpenAI models.
The Vortenza AI Token Counter runs entirely in your browser. No API key needed. Nothing gets sent to any server. You paste your prompt, get a count, and can estimate costs across models before committing to a single API call. Useful if you are switching between providers and want a quick read without setting up local tooling.
Claude and Gemini both provide token counts in their playground interfaces. If you are testing prompts manually anyway, the count is right there without any extra setup.
Be honest about the limitation: cross-model accuracy is approximate because tokenizers differ. Counting tokens in tiktoken for a Claude prompt gives you a rough estimate, not an exact number. For production cost calculations, use the native tooling for whatever model you are actually deploying against.
Bottom line
Almost every developer learns how tokens actually work after something breaks in production. A billing spike. A failed deployment. A context window crash. That is usually when people finally start caring about token economics.
But token counting has to happen before deployment, not after the invoice arrives. Stop trusting rough estimates. Before shipping anything, run your prompts through a proper AI token counter and check actual costs in the API pricing guide.
If you are building chatbots, implement context pruning early. You cannot keep appending conversation history forever without costs quietly multiplying underneath you. And budget for worst-case output generation, not the carefully optimized prompt sitting in your editor. The expensive part is usually what the model generates back.
It is tedious work. Nobody learned programming because they dreamed about staring at token counters all day. But it is still better than waking up to broken requests, context_length_exceeded errors, and an API bill you did not expect.
