How LLMs Work

Prompt: Explain how LLMs work end-to-end for a curious non-specialist: tokens, embeddings, attention, layers, next-token prediction, KV cache, prefill/decode, and a sketch of training. Include hand-drawn-style SVG figures.

A tour of what's actually happening inside a Large Language Model when it writes you a poem, a Python function, or this sentence.

From words to tokens
Embeddings: words as arrows in space
Attention: looking back at what matters
Stacking layers into a deep network
Predicting the next token
The KV cache: not redoing your homework
Prefill vs. decode: two different jobs
How it learned all this

You've probably used a chatbot like ChatGPT or Claude. You type something, and a paragraph of reasonable English comes back. Underneath is a piece of software called a Large Language Model, or LLM: a giant pile of numbers (billions of them) arranged so that, when you feed in some text, it can guess what word is most likely to come next. Doing that over and over is how it writes whole sentences, essays, and computer programs.

This page walks through the main pieces of how that works. By the end, words like token, embedding, attention, KV cache, and prefill will mean something real to you.

1. From words to tokens

Computers don't read words; they read numbers. So the first thing an LLM does is chop your text into little pieces called tokens and look up a number for each one.

A token isn't always a whole word. Common words like the or cat get their own token. Rarer or longer words get split: unbelievable might become un + believ + able. This way the model can handle any word, even ones it's never seen, by gluing tokens together.

A sentence is split into tokens, and each token becomes a number the model can look up.

Every token has an ID: its row number in a giant dictionary (often 50,000 to 200,000 entries). After tokenisation, your sentence is just a list of numbers.

2. Embeddings: words as arrows in space

Numbers like 3797 for cat are arbitrary. They don't tell the model that cats are furry, or that cat is similar to dog. So the model converts each token ID into something richer: an embedding.

An embedding is a list of numbers (say, 4,096 of them) that you can think of as coordinates of a point in a 4,096-dimensional space. That's impossible to picture, but in 2D it looks like this:

Words with related meanings end up near each other in embedding space. (Shown in 2D; real LLMs use thousands of dimensions.)

The neat thing is that direction in this space carries meaning too. The famous example: if you take the embedding for king, subtract man, and add woman, you land very close to queen. The model has learned that one direction means "royal" and another means "gender."

At the start of the LLM, every token gets replaced by its embedding. So your sentence (a list of token IDs) becomes a list of long number-arrows. That stack of arrows is what the model actually thinks about.

3. Attention: looking back at what matters

Now we get to the secret sauce. To predict the next word in "The cat sat on the ___", the model needs to look back over the earlier words and figure out which ones matter. cat and sat are relevant (cats sit on things). The, less so. This selective looking-back is called attention.

For each token, the model computes three little vectors from its embedding:

a Query (Q): "here's what I'm looking for"
a Key (K): "here's what I am, if you're looking for me"
a Value (V): "here's the information I'd contribute if you pay attention to me"

Think of it like a classroom. The current token raises its hand with a question (Q). Every earlier token has a sign on its desk advertising what it knows (K). The student compares its question to all the signs, decides which desks look most relevant, then collects notes (V) from those desks, weighted by how good the match was.

When predicting the next word, the "?" token's Query is compared against every earlier token's Key. The thickness of each line shows how much attention is paid.

The model does this for every token, in parallel, many times over. And it doesn't just do it once per layer: it runs many attention heads side by side, each learning to look for a different kind of pattern. One head might track grammatical subjects; another might match opening and closing brackets; another might follow who-did-what-to-whom across a long paragraph.

Why "Transformer"? The architecture is called a Transformer because attention transforms each token's embedding by mixing in information from the other tokens it's paying attention to. After one round of attention, the arrow for mat isn't just "mat in general"; it's "mat, in this sentence, after a cat sat on it."

4. Stacking layers into a deep network

One round of attention is useful. Many rounds is where the power comes from. Modern LLMs stack the attention machinery into a tall tower of layers: often 30, 80, even 100+ of them. Each layer takes the output of the previous one and refines it further.

Information flows up the stack. Early layers handle surface patterns; later layers handle meaning, facts, and style.

Researchers have peeked inside trained models and found roughly this pattern: early layers notice spelling and word boundaries, middle layers handle grammar and sentence structure, and the top layers handle abstract things like topic, tone, and "what kind of answer is this question expecting?" Nobody programmed those layers to specialise that way; the specialisation emerged on its own during training.

5. Predicting the next token

After all those layers, the model has a final, deeply-cooked embedding for the last token in your input. It then multiplies that embedding by the dictionary of all possible tokens and gets a score for each one. Higher score = more likely to come next.

Those scores are squashed into probabilities (everything between 0 and 1, summing to 100%). To pick the next token, the model can either:

Greedy: always pick the most likely token. Reliable, but boring.
Sample: roll dice weighted by the probabilities. Picks the top token most of the time, but occasionally something less obvious. This is what makes outputs feel creative.

The model produces a probability over every possible next token. Picking one and feeding it back in is how it writes a whole response, one token at a time.

Then comes the loop: append the chosen token to the input, run the whole tower again, get the next token, append, repeat. Stop when the model emits a special end-of-message token. That's how a paragraph appears: not all at once, but one token after another, each one depending on everything that came before.

6. The KV cache: not redoing your homework

Here's a problem. If the model writes a 500-token reply, naively it would run the full tower 500 times, and each time it would re-process the entire conversation from the beginning. That's a colossal waste, because the earlier tokens haven't changed.

Remember the Keys and Values from the attention section? For every token at every layer, the model computed a K and a V. Those don't depend on what comes after. So once you've computed them, you can store them and reuse them forever. This stash is called the KV cache.

The KV cache is a grid: one row per layer, one column per token. To add a new token, the model only needs to fill in one new column.

With a KV cache, generating each new token is enormously cheaper than generating the first one. The model computes the K and V for the new token at each layer, slots them into the cache, and uses the new Q to look at all the cached Ks. No re-doing the old work.

The cache isn't free, though: it lives in the GPU's memory, and it grows with every token. A long conversation can have a KV cache of many gigabytes. That's why very long chats sometimes get slower or hit length limits: the cache is filling up.

7. Prefill vs. decode: two different jobs

When you send a prompt to an LLM, its work splits cleanly into two phases.

Prefill

The model receives your whole input (maybe hundreds or thousands of tokens) and needs to build the KV cache for all of them. The good news: it can process all those tokens in parallel, because they're already known. GPUs love parallel work, so prefill is fast per token, but it's a lot of work because there might be a lot of tokens.

Decode

Then comes the generation phase. The model produces one token at a time: it can't predict token #2 until it has chosen token #1. This is inherently sequential. Each step only adds one column to the KV cache, which sounds tiny, but you have to do hundreds of steps to write a paragraph, and each one requires shuffling that whole giant cache through the GPU.

Prefill is one big parallel computation; decode is a long chain of small ones.

Why this matters: the "time to first token" you feel when a chatbot is thinking is mostly prefill. The streaming speed after that is decode. They're so different that big AI providers literally run them on different machines: some optimised for big parallel prefills, others for tight low-latency decoding.

8. How it learned all this

Everything above describes what happens when you use a trained model. But where did all those billions of numbers come from? Training.

Training works like this: take a colossal amount of text (the internet, books, code repositories) totalling trillions of tokens. Show the model a passage with the last token hidden. Ask it to predict that token. Compare its guess to the real answer. Nudge every one of its billions of numbers a tiny bit in the direction that would have made its guess closer to right. Repeat. Trillions of times.

The basic training loop. The clever bit is the maths (called backpropagation) that decides which way to nudge each parameter.

That's it. That's the whole objective: predict the next token. The astonishing thing is that to get really good at this one boring task, the model has to implicitly learn grammar, facts, arithmetic, programming, reasoning patterns, jokes, and how conversations flow. You can't reliably predict the next word in a Wikipedia article about chess unless you've learned something about chess.

After this base training, modern models go through extra rounds: humans rate which of two answers is better, and the model is nudged towards the preferred style. That's how a raw text-predictor turns into a helpful, polite assistant. But underneath, it's still doing the same thing: one token at a time, guided by attention, accelerated by the KV cache, split into prefill and decode.

Putting it together

Next time you watch a chatbot's reply stream onto the screen, here's what's actually happening:

Your message is chopped into tokens.
Each token is converted into an embedding: a long arrow in meaning-space.
During prefill, all those embeddings flow up through dozens of layers. At each layer, attention mixes information between tokens, with each token's Query checking against every other token's Key, then collecting weighted Values. The resulting K's and V's get saved into the KV cache.
During decode, the top layer produces a probability over the next token. One gets picked. Its K and V at every layer get appended to the cache. Then the model uses that new token's Query to attend over the whole cache and produce the next token. And the next. And the next.
Eventually the model emits an end-of-message token, and the reply is done.

All of that, built on one trick: take a giant pile of numbers, and nudge them, over and over, until they're really good at guessing the next token.

Some terms intentionally simplified. In real models, attention has many heads per layer, layers also include a feed-forward neural network (the "MLP") sandwiched with the attention, and there's a lot of normalization keeping the numbers well-behaved. The shape of the story, though, is exactly this.