Hey folks,
Last week we looked at the three ways to customize LLM behavior: prompting, RAG, and fine-tuning.
This week I want to cover something that trips up almost every developer when they first start building with AI. Tokens and context windows. Once you get this, a lot of the weird behavior you've seen from LLMs starts to make sense.
You've Probably Hit This Already
You're mid-conversation with ChatGPT. You pasted in a long document. The responses start getting vague. It stops referencing things you mentioned earlier. Or you get a hard error: "This model's maximum context length is 128,000 tokens."
What happened?
To understand it, you need to know how LLMs actually read text.
What Is a Token?
Most people assume LLMs read words. They don't.
LLMs read tokens. A token is a fragment of text. It can be a full word, part of a word, a punctuation mark, or even a space. Tokenization is how raw text gets broken down into something the model can process numerically.
Some examples using GPT's tokenizer:
"Hello"is 1 token"tokenization"is roughly 3 tokens"ChatGPT is great!"is 5 tokens"unbelievable"is 2-3 tokens depending on the model
The rule of thumb: 1 token is about 0.75 words, or 4 characters. So 1,000 words is roughly 1,300 tokens.
Why not just use words? Because tokenization lets models handle anything: made-up words, technical jargon, code, foreign languages. It breaks unknown text into known subword pieces. Much more flexible than a fixed word vocabulary.
One more thing: different models use different tokenizers. The same text can have different token counts in GPT-4 vs Claude vs Gemini.
What Is a Context Window?
Every LLM has a context window. It's the maximum number of tokens the model can process in a single call.
Think of it as working memory. You can only hold so much in your head at once. Same with an LLM. Everything inside the context window is what the model uses to generate its response. Everything outside it, the model simply cannot see.
The context window holds everything:
- Your system prompt
- The full conversation history
- Any documents you've injected
- The model's previous responses
- The response it's currently writing
All of it counts toward the limit together.
Context windows across popular models today:
| Model | Context Window |
|---|---|
| GPT-4o | 128,000 tokens (~96,000 words) |
| Claude 3.5 Sonnet | 200,000 tokens (~150,000 words) |
| Gemini 1.5 Pro | 1,000,000 tokens (~750,000 words) |
| GPT-3.5 Turbo | 16,000 tokens (~12,000 words) |
| Llama 3 (8B) | 8,000 tokens (~6,000 words) |
These numbers have grown a lot. Early GPT-3 had a 4,096 token limit. GPT-4 launched at 8K, expanded to 32K, now sits at 128K. Gemini's 1M window can fit entire codebases.
Why Models Start "Forgetting"
This is the part that catches most developers off guard.
LLMs have no persistent memory between requests. Each API call is stateless. The model only sees what's in the current context window, nothing else. When you have a long chat in ChatGPT, the app is silently appending the full conversation history to every single request behind the scenes.
A conversation that starts at 100 tokens grows to 500, then 2,000, then 10,000. Eventually you hit the limit.
When that happens, one of three things occurs depending on how the app is built:
- Hard error -- the API rejects the call
- Truncation -- the oldest messages get quietly dropped
- Summarization -- older context gets compressed automatically
Well-built chat apps handle this gracefully with summarization. But if you're building your own system, this is your problem to solve.
This is also why responses feel inconsistent in long conversations. If early messages got truncated, the model genuinely cannot see them anymore. It's not hallucinating. The information just isn't there.
Why Big Context Windows Are Expensive
From Issue #45 on Attention, you know that transformers compute relationships between every token and every other token in the context. That's self-attention.
The problem: the compute required grows quadratically with context length. Double the context, quadruple the compute.
That's why:
- Bigger context windows cost more per API call
- Longer prompts take longer to process
- A 1M token model costs significantly more to run than a 128K one
This also explains why "just make context windows infinite" is not a simple engineering problem. The cost curve is brutal.
Practical takeaway: don't stuff the context window for no reason. A focused 5,000-token prompt will often outperform a bloated 50,000-token one, and it'll cost 10x less.
Tokens Cost Money
Every API call is billed by the token. Getting good at this will save you real money.
Typical pricing as of mid-2025:
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| GPT-4o | $5.00 | $15.00 |
| Claude 3.5 Sonnet | $3.00 | $15.00 |
| GPT-4o mini | $0.15 | $0.60 |
| Gemini 1.5 Flash | $0.075 | $0.30 |
Input tokens (your prompt and context) are cheaper than output tokens (the model's response). Worth keeping in mind when you design your system.
A concrete example: you're building a customer support bot and you dump your entire 50,000-word knowledge base into every prompt. That's about 65,000 tokens per call. At $5 per 1M tokens, each query costs $0.33 on input alone. At 10,000 queries a day, that's $3,300/day just on input.
RAG from Issue #48 solves this. Instead of sending 65,000 tokens, you retrieve the 3 most relevant chunks, roughly 2,000 tokens. Same query now costs $0.01. That's a 33x reduction.
Practical Rules for Working with Tokens
Check token counts before going to production. OpenAI has tiktoken. Anthropic exposes token counts in API responses. Don't estimate, measure.
Put the most important content at the start or end of your prompt. LLMs pay more attention to the beginning and end of context. Burying critical instructions in the middle of a long prompt is a good way to have them ignored. This is called the lost-in-the-middle problem and it's well documented.
Trim your system prompts. System prompts run on every single request. A 2,000-token system prompt you could rewrite as 400 tokens adds up fast at any real scale.
Use cheaper models for token-heavy preprocessing. Need to process a long document? Run it through GPT-4o mini or Gemini Flash first to summarize, then pass the summary to a more capable model. Tiered inference keeps costs in check.
Plan for context limits upfront. Don't build a system assuming unlimited context. Decide early how you'll handle conversations that get too long. Summarize? Truncate? Store history in a database and retrieve it with RAG?
Context Windows Are Not the Same as Memory
This one confuses a lot of people.
| Context Window | Long-Term Memory | |
|---|---|---|
| Scope | Single conversation | Across sessions |
| Mechanism | In-context tokens | External storage (DB, vector store) |
| Cost | Charged per token | Storage + retrieval |
| Persistence | Gone when the session ends | Persists indefinitely |
| Example | Claude reading a document you pasted | Claude remembering your name next week |
When you see products advertise "AI with memory," they're using external storage plus retrieval, not a bigger context window. The model still has the same fixed limit. They're just being selective about what goes into it.
Wrapping Up
Tokens are how LLMs read text. Subword fragments, not words. Context windows are how much they can read at once. These two things shape almost every practical decision you'll make when building with AI.
The things worth remembering:
- 1 token is roughly 0.75 words. Count tokens, not words.
- Context windows hold everything: system prompt, history, injected docs, the current response.
- LLMs are stateless. No memory outside the context window.
- Attention scales quadratically. Bigger context costs more and runs slower.
- Tokens are money. Treat your prompts like you treat your database queries.
- Lost-in-the-middle is real. Keep critical content near the top or bottom of your context.
Once this clicks, you'll write tighter prompts, design smarter architectures, and waste a lot less on API costs.
What's Next
Next week: Temperature & Sampling -- the settings that control how creative or predictable an AI's output is, and when you'd actually want to change them.
Read the full AI Learning series -> Learn AI
How was today's email?