GPUs
Prompt: Explain what a GPU is, how it differs from a CPU, how its hardware is organised (SMs, warps, threads, memory hierarchy), how it gets fed work (kernels, CUDA, SIMT), and why it became the workhorse of modern AI. Include diagrams and an interactive bit.
A graphics chip that learned to multiply matrices, became the engine of modern AI, and is now the most fought-over piece of silicon on the planet.
A GPU (Graphics Processing Unit) is a chip originally built to colour in pixels for video games. It turned out the maths it was good at, doing the same arithmetic on huge numbers of numbers at once, is exactly the maths that scientific simulation, image processing, and (later) neural networks need. So a piece of hardware designed to shade triangles ended up powering ChatGPT.
This page is about what's inside a GPU, how it's wired up differently from a CPU, and why that shape of hardware suits modern AI so well.
1. What a GPU actually is
Physically, a GPU is a single chip stuck to a circuit board, surrounded by its own memory (often 24, 80, or 192 GB of it), with a fat cable connecting it to the rest of the computer. Inside that chip are thousands of simple arithmetic units, organised into groups, all able to run at the same time.
Where a CPU is built to run a few independent things very quickly and cleverly, a GPU is built to run thousands of nearly-identical things in lockstep. If your problem looks like "apply this little calculation to a million data points," a GPU will eat it for breakfast. If your problem looks like "do one tricky thing that depends on the last tricky thing," a GPU is the wrong tool.
2. GPU vs. CPU: many small, few big
The deepest difference between a CPU and a GPU is the trade-off they make between per-thread cleverness and how many threads run at once.
A CPU core is a fiendishly sophisticated piece of engineering. It looks dozens of instructions ahead, guesses which branch a program will take, executes things out of order, and undoes them if it guessed wrong. All that machinery exists to make a single sequential program run as fast as physically possible. A typical CPU has 8 to 64 of these heavyweight cores.
A GPU does the opposite. Each of its arithmetic units is small and simple. It doesn't predict branches; it doesn't reorder instructions. But there are thousands of them on one chip, and they're built to do the same thing as each other in step. If your work is uniform enough to keep them all busy, a GPU pulls ahead by sheer headcount.
This is sometimes summed up as: a CPU is a few professors, a GPU is thousands of schoolchildren. If you need someone to write a novel, the professors win. If you need someone to mark a million multiple-choice tests, the schoolchildren win.
3. SIMT: one instruction, lots of data
Those thousands of GPU cores don't each fetch their own instructions independently. That would be a waste of silicon. Instead, the GPU rounds up a group of threads, hands them one instruction, and tells them all to execute it at the same time, each on their own piece of data. NVIDIA calls this model SIMT: Single Instruction, Multiple Threads.
On NVIDIA hardware, the unit of lockstep execution is a warp of 32 threads. The chip fetches one instruction, decodes it once, then 32 threads do that same thing on 32 different inputs in parallel.
This is why GPU programmers care about branch divergence. If your code says if (x > 0) do_thing_a else do_thing_b and half the warp takes each branch, the hardware has no choice but to run them one after the other, with the off-branch threads sitting idle. You've lost half your throughput. Code that does the same thing on every element of an array doesn't have this problem.
4. Inside the chip: SMs, warps, threads
Zoom in on a modern NVIDIA GPU and you'll see it's not one giant blob of cores. It's a grid of around 100 to 150 little processors called Streaming Multiprocessors, or SMs. Each SM has its own registers, its own scratchpad memory, its own scheduler, and a handful of warp slots.
The hierarchy goes:
| Level | What it is | Rough count |
|---|---|---|
| Thread | One arithmetic unit's worth of work; one value of i | tens of thousands live at once |
| Warp | 32 threads that step through the same instruction stream | thousands |
| Block | Groups of warps that share fast scratchpad memory and can cooperate | hundreds |
| SM | The physical processor that runs blocks | 100–150 per chip |
| GPU | The whole chip | 1 |
An SM keeps far more warps "in flight" than it can actually run on any single clock tick. When one warp stalls waiting for memory, the scheduler instantly switches to another. This latency hiding is how the chip stays busy: while some threads are waiting for data from the far edge of the chip, others are computing.
5. The memory hierarchy
A GPU has multiple kinds of memory, arranged in a pyramid: small and fast near the cores, big and slow far away. Anyone writing GPU code spends most of their effort moving data up this pyramid before computing on it.
| Memory | Where | Size | Speed |
|---|---|---|---|
| Registers | Inside each thread | tiny (hundreds of values) | instant |
| Shared memory / L1 | Per SM | ~100 KB per SM | ~1 cycle |
| L2 cache | On chip, shared | ~50–100 MB | tens of cycles |
| HBM (VRAM) | Stacks of chips next to the GPU die | tens to hundreds of GB | hundreds of cycles |
| Host RAM | The CPU's memory, across PCIe / NVLink | ~TB | microseconds |
Modern GPUs also have tensor cores: special units inside each SM that don't add or multiply single numbers, they multiply small matrices in one shot. A tensor core can do a 4×4 matrix multiply-and-accumulate as a single instruction. That's the trick that makes AI workloads so fast on these chips: most of what a neural network does, at the bottom, is multiplying matrices.
6. Why bandwidth, not flops, is usually the limit
Marketing brochures love to quote a GPU's peak number of arithmetic operations per second: terabytes of flops, petabytes of flops, etc. In practice, you almost never get those numbers, because the cores spend most of their time waiting for data to arrive from HBM.
An H100, for example, can do roughly 1,000 trillion 16-bit operations per second. But its HBM can only deliver about 3 trillion bytes per second. If every value you operate on has to be freshly loaded from memory, the chip will sit idle most of the time.
The figure of merit is arithmetic intensity: how many calculations you do per byte of memory you read. Operations with high intensity (matrix multiplication, big convolutions) make the GPU happy. Operations with low intensity (adding two vectors, applying a simple function elementwise) are memory-bound: they finish only as fast as data can arrive.
7. Kernels and CUDA
A program for the GPU is called a kernel. A kernel is a single function written in a GPU-aware language, that the host (the CPU side) launches with a grid of thread-block coordinates. Each thread runs the same function body, but with a different built-in index, and uses that index to decide which piece of data it's responsible for.
The simplest possible kernel: add two arrays elementwise.
__global__ void add(const float* a, const float* b, float* c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) c[i] = a[i] + b[i];
} Every thread gets a different i from those built-in variables. With one million threads, that one kernel adds two million-element arrays in essentially one go.
This is the CUDA programming model, NVIDIA's name for the C-like language and runtime that lets you write kernels and launch them. AMD has a near-clone called HIP / ROCm; Apple has Metal; there's also OpenCL and SYCL. CUDA dominates in practice because NVIDIA shipped it earliest, kept extending it, and built a vast library ecosystem (cuBLAS for linear algebra, cuDNN for neural network primitives, cuFFT for Fourier transforms, etc.) that almost all AI software now sits on top of.
Most people writing AI code today never touch CUDA directly. They write PyTorch or JAX, which under the hood dispatches every operation to a pre-written CUDA kernel. The Python you write is a thin orchestrator; the heavy lifting happens in compiled C++ kernels you never see.
8. How GPUs ended up running AI
Neural networks are mostly matrix multiplications and a few simple elementwise functions on top. A forward pass through a transformer layer, for instance, is dominated by a handful of large matmuls: the attention projections, the feed-forward block, and the output projection. Same story for the backward pass during training.
Matrix multiplication is the textbook example of a high-intensity, embarrassingly parallel workload. Every output element is the sum of a row times a column; the rows and columns can be loaded once and reused across many outputs; the work splits naturally across thousands of threads. This is precisely what GPUs are best at, and tensor cores are tuned for.
The chain of events looked roughly like:
- 2007: NVIDIA releases CUDA. Scientific computing people start using GPUs for non-graphics work.
- 2012: AlexNet wins ImageNet by a wide margin, trained on two consumer GPUs. The deep learning revolution begins.
- 2017: The Transformer paper. The architecture turns out to scale beautifully with more GPUs and more data.
- 2018–2020: NVIDIA adds tensor cores; PyTorch and TensorFlow standardise on CUDA.
- 2022–now: LLMs eat the world. GPU clusters become strategic national infrastructure; the H100 sells for the price of a luxury car; AMD and the hyperscalers (Google's TPU, Amazon's Trainium, Apple's silicon, Meta's MTIA) all rush to build alternatives.
None of this was inevitable. Researchers in the 1990s had neural networks; they just didn't have hardware fast enough to train interesting ones. The GPU showed up, almost by accident, as the right shape of chip at the right moment. That's why every AI cluster you read about (xAI's Colossus, Microsoft's Stargate, Meta's RSC) is, fundamentally, a warehouse full of GPUs wired together.
9. Try it: shape a workload
The interactive widget below is a toy model of how a workload's shape changes whether the GPU is happy. Adjust the sliders and watch how utilisation and time-per-step shift.
—
Some patterns worth poking at:
- Batch size = 1, low work. Tiny batch, tiny per-sample work: the GPU sits mostly idle. This is what naive single-sample inference looks like.
- Crank batch size up. Utilisation climbs, step time grows much more slowly than batch size. This is why inference servers batch requests.
- High memory pressure. Bandwidth bar pegs at 100%; utilisation stalls no matter how high you push compute. You've gone memory-bound.
- Add branch divergence. Utilisation drops even when nothing else changes; warps are serialising their paths.
Recap
A GPU is a chip that trades single-thread cleverness for sheer parallel headcount. Its thousands of small cores are organised into SMs, which run warps of 32 threads in lockstep. Its memory comes in layers: tiny and fast on chip, large and slow off chip, with most of the engineering effort going into not waiting for it.
That shape, lots of identical arithmetic on contiguous data, happens to be exactly the shape of matrix multiplication, which happens to be most of what a neural network does. Add a software stack (CUDA, cuDNN, PyTorch) and a couple of decades of accumulated tooling, and you get the chip that defines the current era of computing.
Numbers and rules of thumb are approximate and lean on NVIDIA's data-centre GPUs (A100 / H100 / B200 generation). Other vendors arrange the same basic ideas with different names: AMD calls warps "wavefronts" of 64, Apple calls SMs "GPU cores," and so on. The shape of the story is the same.