Hey folks,

Last week we looked at the three ways to customize LLM behavior: prompting, RAG, and fine-tuning.

This week I want to cover something that trips up almost every developer when they first start building with AI. Tokens and context windows. Once you get this, a lot of the weird behavior you've seen from LLMs starts to make sense.

You've Probably Hit This Already

You're mid-conversation with ChatGPT. You pasted in a long document. The responses start getting vague. It stops referencing things you mentioned earlier. Or you get a hard error: "This model's maximum context length is 128,000 tokens."

What happened?

To understand it, you need to know how LLMs actually read text.

What Is a Token?

Most people assume LLMs read words. They don't.

LLMs read tokens. A token is a fragment of text. It can be a full word, part of a word, a punctuation mark, or even a space. Tokenization is how raw text gets broken down into something the model can process numerically.

Some examples using GPT's tokenizer:

  • "Hello" is 1 token
  • "tokenization" is roughly 3 tokens
  • "ChatGPT is great!" is 5 tokens
  • "unbelievable" is 2-3 tokens depending on the model

The rule of thumb: 1 token is about 0.75 words, or 4 characters. So 1,000 words is roughly 1,300 tokens.

Why not just use words? Because tokenization lets models handle anything: made-up words, technical jargon, code, foreign languages. It breaks unknown text into known subword pieces. Much more flexible than a fixed word vocabulary.

One more thing: different models use different tokenizers. The same text can have different token counts in GPT-4 vs Claude vs Gemini.

What Is a Context Window?

Every LLM has a context window. It's the maximum number of tokens the model can process in a single call.

Think of it as working memory. You can only hold so much in your head at once. Same with an LLM. Everything inside the context window is what the model uses to generate its response. Everything outside it, the model simply cannot see.

The context window holds everything:

  • Your system prompt
  • The full conversation history
  • Any documents you've injected
  • The model's previous responses
  • The response it's currently writing

All of it counts toward the limit together.

Context windows across popular models today:

Model Context Window
GPT-4o 128,000 tokens (~96,000 words)
Claude 3.5 Sonnet 200,000 tokens (~150,000 words)
Gemini 1.5 Pro 1,000,000 tokens (~750,000 words)
GPT-3.5 Turbo 16,000 tokens (~12,000 words)
Llama 3 (8B) 8,000 tokens (~6,000 words)

These numbers have grown a lot. Early GPT-3 had a 4,096 token limit. GPT-4 launched at 8K, expanded to 32K, now sits at 128K. Gemini's 1M window can fit entire codebases.

Why Models Start "Forgetting"

This is the part that catches most developers off guard.

LLMs have no persistent memory between requests. Each API call is stateless. The model only sees what's in the current context window, nothing else. When you have a long chat in ChatGPT, the app is silently appending the full conversation history to every single request behind the scenes.

A conversation that starts at 100 tokens grows to 500, then 2,000, then 10,000. Eventually you hit the limit.

When that happens, one of three things occurs depending on how the app is built:

  1. Hard error -- the API rejects the call
  2. Truncation -- the oldest messages get quietly dropped
  3. Summarization -- older context gets compressed automatically

Well-built chat apps handle this gracefully with summarization. But if you're building your own system, this is your problem to solve.

This is also why responses feel inconsistent in long conversations. If early messages got truncated, the model genuinely cannot see them anymore. It's not hallucinating. The information just isn't there.

Why Big Context Windows Are Expensive

From Issue #45 on Attention, you know that transformers compute relationships between every token and every other token in the context. That's self-attention.

The problem: the compute required grows quadratically with context length. Double the context, quadruple the compute.

That's why:

  • Bigger context windows cost more per API call
  • Longer prompts take longer to process
  • A 1M token model costs significantly more to run than a 128K one

This also explains why "just make context windows infinite" is not a simple engineering problem. The cost curve is brutal.

Practical takeaway: don't stuff the context window for no reason. A focused 5,000-token prompt will often outperform a bloated 50,000-token one, and it'll cost 10x less.

Tokens Cost Money

Every API call is billed by the token. Getting good at this will save you real money.

Typical pricing as of mid-2025:

Model Input (per 1M tokens) Output (per 1M tokens)
GPT-4o $5.00 $15.00
Claude 3.5 Sonnet $3.00 $15.00
GPT-4o mini $0.15 $0.60
Gemini 1.5 Flash $0.075 $0.30

Input tokens (your prompt and context) are cheaper than output tokens (the model's response). Worth keeping in mind when you design your system.

A concrete example: you're building a customer support bot and you dump your entire 50,000-word knowledge base into every prompt. That's about 65,000 tokens per call. At $5 per 1M tokens, each query costs $0.33 on input alone. At 10,000 queries a day, that's $3,300/day just on input.

RAG from Issue #48 solves this. Instead of sending 65,000 tokens, you retrieve the 3 most relevant chunks, roughly 2,000 tokens. Same query now costs $0.01. That's a 33x reduction.

Practical Rules for Working with Tokens

Check token counts before going to production. OpenAI has tiktoken. Anthropic exposes token counts in API responses. Don't estimate, measure.

Put the most important content at the start or end of your prompt. LLMs pay more attention to the beginning and end of context. Burying critical instructions in the middle of a long prompt is a good way to have them ignored. This is called the lost-in-the-middle problem and it's well documented.

Trim your system prompts. System prompts run on every single request. A 2,000-token system prompt you could rewrite as 400 tokens adds up fast at any real scale.

Use cheaper models for token-heavy preprocessing. Need to process a long document? Run it through GPT-4o mini or Gemini Flash first to summarize, then pass the summary to a more capable model. Tiered inference keeps costs in check.

Plan for context limits upfront. Don't build a system assuming unlimited context. Decide early how you'll handle conversations that get too long. Summarize? Truncate? Store history in a database and retrieve it with RAG?

Context Windows Are Not the Same as Memory

This one confuses a lot of people.

Context Window Long-Term Memory
Scope Single conversation Across sessions
Mechanism In-context tokens External storage (DB, vector store)
Cost Charged per token Storage + retrieval
Persistence Gone when the session ends Persists indefinitely
Example Claude reading a document you pasted Claude remembering your name next week

When you see products advertise "AI with memory," they're using external storage plus retrieval, not a bigger context window. The model still has the same fixed limit. They're just being selective about what goes into it.

Wrapping Up

Tokens are how LLMs read text. Subword fragments, not words. Context windows are how much they can read at once. These two things shape almost every practical decision you'll make when building with AI.

The things worth remembering:

  • 1 token is roughly 0.75 words. Count tokens, not words.
  • Context windows hold everything: system prompt, history, injected docs, the current response.
  • LLMs are stateless. No memory outside the context window.
  • Attention scales quadratically. Bigger context costs more and runs slower.
  • Tokens are money. Treat your prompts like you treat your database queries.
  • Lost-in-the-middle is real. Keep critical content near the top or bottom of your context.

Once this clicks, you'll write tighter prompts, design smarter architectures, and waste a lot less on API costs.

What's Next

Next week: Temperature & Sampling -- the settings that control how creative or predictable an AI's output is, and when you'd actually want to change them.

Read the full AI Learning series -> Learn AI

Thanks for reading! Got questions, feedback, or want to chat about AI? Hit reply – I read and respond to every message. And if you found this valuable, feel free to forward it to a friend who'd benefit!

Pranay
Pranay
Infolia.ai

💬 Join the Discussion