How Large Language Models Actually Work — Tokens, Attention, and the Math Behind ChatGPT
A Large Language Model is, mechanically, a function. You give it a sequence of words. It returns a probability distribution over what word should come next. ...
What an LLM Actually Is
A Large Language Model is, mechanically, a function. You give it a sequence of words. It returns a probability distribution over what word should come next. That is the entire job. Everything dramatic — answering questions, writing code, holding a conversation, summarizing a paper — is built on top of that one repeated prediction.
The illusion of conversation comes from looping: predict the next word, append it to the input, predict the next word again, repeat. A response that feels like a thought is actually a chain of thousands of next-word predictions, each one slightly informed by all the words that came before.
Once you internalize that, every weird LLM behavior becomes legible. The "creativity" is sampling. The "memory" is the prompt. The "hallucinations" are confident predictions made on incomplete data. There is no thinking happening between predictions; there is only a very large pattern-matcher running very fast.
This article is the mental model. Tokens, context, attention, training, inference — what they are, what they cost, and why the model gives you different answers each time.
Tokens — The Unit of Language
LLMs do not see words. They see tokens — small chunks of text that may be a word, part of a word, or a punctuation mark. The English word "language" might be one token. The word "tokenization" might be split into "token" + "ization." A Chinese character is usually one token by itself.
Tokenization happens before the model sees anything. A separate piece of software (the tokenizer) splits your input text into a list of integer IDs from a fixed vocabulary, usually around 100,000 to 200,000 tokens for modern models. The model only ever sees these integer IDs.
This matters in practice for two reasons:
- You pay per token, not per word. OpenAI, Anthropic, Google all charge by the token. A 1000-word article is roughly 1300 tokens in English, more in languages with smaller token vocabularies (Hindi, Japanese, Arabic).
- Context windows are measured in tokens. When you read "1M context window," that is one million tokens — about 750,000 words of English, or roughly two complete novels.
You can play with the tokenizer to see exactly how text is split: OpenAI ships a free tool, and most LLM APIs return token counts in their responses. Looking at a few examples will demystify the unit faster than any explanation.
Context — The Model's Working Memory
A model has no persistent memory between calls. Everything it knows about your conversation is whatever you sent in this single request. If you ask "what was my name again?" and your name is not in the current request, the model does not have it.
The context window is the size of that single request. Modern models have context windows of:
| Model class | Context window |
|---|---|
| GPT-3.5 (2022) | 4,096 tokens |
| GPT-4 (2023) | 8,192 → 32,768 |
| Claude 2 (2023) | 100,000 |
| Gemini 1.5 (2024) | 1,000,000 |
| Gemini 3 (2025) | 1,000,000+ |
Bigger context lets you stuff more into the prompt — entire books, full code repositories, hours of meeting transcripts. The cost is real: every token in the context is processed for every token of output, so a 1M-token prompt is expensive in both money and latency.
The way real applications get around this is retrieval — store all your data somewhere searchable, and at request time pull only the most relevant bits into the prompt. This pattern is called RAG (retrieval-augmented generation) and powers most production AI applications you have used.
Attention — How the Model Decides What Matters
When the model predicts the next token, it does not weight every previous token equally. It uses a mechanism called attention to focus on the parts of the input most relevant to the current prediction.
A simplified picture: for each new token the model is about to generate, it computes a similarity score between the current state and every previous token, then weights its prediction by those scores. Tokens that look related to the current step get heavy weight; unrelated tokens get near-zero weight.
This is why a long prompt with one specific question buried in the middle still works — the model attends to the question even though there is a wall of context around it. It is also why prompt structure matters. Putting your actual question at the end of a long prompt typically works better than burying it in the middle, because the recent tokens get extra attention by default in most architectures.
The technical name for the architecture built around attention is the Transformer, introduced in a 2017 Google paper titled "Attention is All You Need." Every modern LLM you have heard of — GPT, Claude, Gemini, LLaMA, Mistral — is a Transformer. The differences are training data, parameter count, fine-tuning details, and engineering polish, not the underlying architecture.
Training — Where the Knowledge Comes From
A pretrained LLM has been shown trillions of tokens of text scraped from the public internet, books, papers, code, and curated datasets. During training, the model is repeatedly given a chunk of text with the last word hidden, asked to predict the missing word, and corrected when wrong. Run this over enough text with enough parameters and the model learns the statistical structure of language deeply enough to fool everyone.
After this base training, modern models go through alignment — additional training where humans rate outputs and the model is nudged toward helpful, harmless, honest behavior. This step is what makes ChatGPT feel like a polite assistant rather than a raw text predictor.
A few honest implications:
- The training data has a cutoff date. If a model was trained on data up to mid-2024, it has no idea what happened in 2025. It will sometimes confidently answer anyway, because the training did not teach it to say "I do not know."
- The model knows the patterns in its training data, not the truth. If false information was repeated often in the training set, the model has learned the false information.
- Costs are dominated by training, not inference. A single training run for a frontier model can cost $50M+. The per-query cost you pay is a small fraction of that.
Inference — Why Each Run Is Slightly Different
When you send a prompt, the model produces a probability distribution over the next token. There are tens of thousands of possible next tokens with non-zero probabilities. The system has to pick one.
The default approach is sampling — pick a token randomly, weighted by the probabilities. The most likely token is most likely to be picked, but the second-most-likely also gets a real chance. Two parameters control this:
- Temperature — how much randomness to inject.
temperature=0always picks the most likely token (deterministic).temperature=1.0is the default randomness.temperature=2.0is wild. - Top-p / top-k — how many of the top candidates are eligible to be picked at all. Caps the long tail of unlikely tokens.
This is why asking the same question twice gives different answers. With temperature=0, you get the same answer every time, but it is also the most boring answer — exactly the most likely token at every step, no creativity in the variation. Most production systems run somewhere between 0.3 and 0.7.
Why Hallucinations Happen
A model has no concept of "I do not know." It has only "the probability distribution over the next token, given everything in context." If the most likely continuation looks like a confident statement of fact, the model produces a confident statement of fact, regardless of whether the fact is true.
Hallucinations cluster in predictable places:
- Specific numbers and dates the model is not confident about — it picks a plausible-looking number.
- Citations and URLs — the model has learned what citations look like, so it generates citation-shaped strings even when no real source exists.
- Anything past the training cutoff — the model interpolates from old data.
- Domain-specific facts outside the training distribution — coding patterns from obscure libraries, medical guidelines that changed recently, niche legal details.
The mitigations real applications use are: retrieval (give the model the source documents), tool use (let it call a calculator or search engine), and verification (a second model reads the first model's output and flags claims that need to be checked). None of these are complete fixes. They reduce, not eliminate.
What "Thinking Mode" and "Reasoning" Mean
Some recent models advertise a "thinking mode" or chain-of-thought capability. Mechanically, this means the model is allowed to generate hidden tokens — a scratchpad of intermediate reasoning — before producing the visible answer. The architecture is unchanged; the model is the same Transformer doing the same next-token prediction. What changed is that it is permitted to predict more tokens before the response is shown to the user.
This works surprisingly well for math, coding, and structured reasoning, because the visible output is the result of a longer chain of conditional predictions instead of a single shot. It does not work magic for tasks where the bottleneck is missing information rather than missing thought.
What This Costs
A practical comparison, assuming standard pricing as of 2025:
| Model | Input cost / 1M tokens | Output cost / 1M tokens |
|---|---|---|
| GPT-4o-mini | $0.15 | $0.60 |
| Claude 3.5 Haiku | $1.00 | $5.00 |
| Gemini 2.5 Flash | $0.30 | $2.50 |
| Gemini 3 Flash Preview | $0.50 | $3.00 |
| GPT-4o | $5.00 | $20.00 |
| Claude 3.5 Sonnet | $3.00 | $15.00 |
For a typical chatbot interaction (~500 input tokens, ~500 output tokens), this is fractions of a cent for the small models and a few cents for the large ones. The model choice matters a lot when you are doing millions of calls; less so for hobby projects.
Where This Fits
Lesson 11 of the ABCsteps curriculum has you call an LLM API for the first time. Without the model in your head — tokens, context, sampling — the API parameters look arbitrary. With it, every parameter has a clear purpose. The lesson's keystrokes will set temperature: 0.7 and max_tokens: 500 and you will know exactly what you are asking for.
Apply this hands-on · Module C
AI Products Are API Systems
Lesson 11 uses an LLM API for the first time. This article explains tokens, attention, and why the model gives you different answers each time — the foundation that makes the lesson make sense.
Open lesson