← pllu.net

A GPU (Graphics Processing Unit) is a chip originally built to colour in pixels for video games. It turned out the maths it was good at, doing the same arithmetic on huge numbers of numbers at once, is exactly the maths that scientific simulation, image processing, and (later) neural networks need. So a piece of hardware designed to shade triangles ended up powering ChatGPT.

This page is about what's inside a GPU, how it's wired up differently from a CPU, and why that shape of hardware suits modern AI so well.

1. What a GPU actually is

Physically, a GPU is a single chip stuck to a circuit board, surrounded by its own memory (often 24, 80, or 192 GB of it), with a fat cable connecting it to the rest of the computer. Inside that chip are thousands of simple arithmetic units, organised into groups, all able to run at the same time.

Where a CPU is built to run a few independent things very quickly and cleverly, a GPU is built to run thousands of nearly-identical things in lockstep. If your problem looks like "apply this little calculation to a million data points," a GPU will eat it for breakfast. If your problem looks like "do one tricky thing that depends on the last tricky thing," a GPU is the wrong tool.

GPU board (the thing you slot into a server) GPU chip (thousands of arithmetic units) HBM memory stack HBM memory stack HBM memory stack HBM memory stack PCIe / NVLink to the rest of the machine
A GPU is a chip and its memory, glued together on a board. Most of the silicon is arithmetic; most of the surface area around it is memory.

2. GPU vs. CPU: many small, few big

The deepest difference between a CPU and a GPU is the trade-off they make between per-thread cleverness and how many threads run at once.

A CPU core is a fiendishly sophisticated piece of engineering. It looks dozens of instructions ahead, guesses which branch a program will take, executes things out of order, and undoes them if it guessed wrong. All that machinery exists to make a single sequential program run as fast as physically possible. A typical CPU has 8 to 64 of these heavyweight cores.

A GPU does the opposite. Each of its arithmetic units is small and simple. It doesn't predict branches; it doesn't reorder instructions. But there are thousands of them on one chip, and they're built to do the same thing as each other in step. If your work is uniform enough to keep them all busy, a GPU pulls ahead by sheer headcount.

CPU few big cores, lots of cache and control large shared cache + branch prediction + out-of-order GPU thousands of tiny cores, minimal control small caches, big shared memory bus instead
Same chip area, different bargain. CPUs spend transistors on each core being smart; GPUs spend them on having lots of cores.

This is sometimes summed up as: a CPU is a few professors, a GPU is thousands of schoolchildren. If you need someone to write a novel, the professors win. If you need someone to mark a million multiple-choice tests, the schoolchildren win.

3. SIMT: one instruction, lots of data

Those thousands of GPU cores don't each fetch their own instructions independently. That would be a waste of silicon. Instead, the GPU rounds up a group of threads, hands them one instruction, and tells them all to execute it at the same time, each on their own piece of data. NVIDIA calls this model SIMT: Single Instruction, Multiple Threads.

On NVIDIA hardware, the unit of lockstep execution is a warp of 32 threads. The chip fetches one instruction, decodes it once, then 32 threads do that same thing on 32 different inputs in parallel.

one instruction: c[i] = a[i] + b[i] 32 threads of one warp, each with its own i: i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 ... ... ... ... i=30 i=31 all 32 add a[i]+b[i] simultaneously, write c[i] simultaneously Branch divergence if some threads take an "if" branch and others don't, the warp serialises: runs the true-branch threads while the false-branch threads idle, then vice versa.
32 threads, one instruction. Great when they all agree; expensive when they disagree.

This is why GPU programmers care about branch divergence. If your code says if (x > 0) do_thing_a else do_thing_b and half the warp takes each branch, the hardware has no choice but to run them one after the other, with the off-branch threads sitting idle. You've lost half your throughput. Code that does the same thing on every element of an array doesn't have this problem.

4. Inside the chip: SMs, warps, threads

Zoom in on a modern NVIDIA GPU and you'll see it's not one giant blob of cores. It's a grid of around 100 to 150 little processors called Streaming Multiprocessors, or SMs. Each SM has its own registers, its own scratchpad memory, its own scheduler, and a handful of warp slots.

The hierarchy goes:

LevelWhat it isRough count
ThreadOne arithmetic unit's worth of work; one value of itens of thousands live at once
Warp32 threads that step through the same instruction streamthousands
BlockGroups of warps that share fast scratchpad memory and can cooperatehundreds
SMThe physical processor that runs blocks100–150 per chip
GPUThe whole chip1
GPU chip (lots of SMs) zoom in one SM block (cooperating warps, shared scratchpad) warp warp warp warp block warp warp warp warp SM resources register file · shared memory · L1 cache warp scheduler · tensor cores
A GPU is a grid of SMs; an SM runs blocks of warps; a warp is 32 threads in lockstep.

An SM keeps far more warps "in flight" than it can actually run on any single clock tick. When one warp stalls waiting for memory, the scheduler instantly switches to another. This latency hiding is how the chip stays busy: while some threads are waiting for data from the far edge of the chip, others are computing.

5. The memory hierarchy

A GPU has multiple kinds of memory, arranged in a pyramid: small and fast near the cores, big and slow far away. Anyone writing GPU code spends most of their effort moving data up this pyramid before computing on it.

MemoryWhereSizeSpeed
RegistersInside each threadtiny (hundreds of values)instant
Shared memory / L1Per SM~100 KB per SM~1 cycle
L2 cacheOn chip, shared~50–100 MBtens of cycles
HBM (VRAM)Stacks of chips next to the GPU dietens to hundreds of GBhundreds of cycles
Host RAMThe CPU's memory, across PCIe / NVLink~TBmicroseconds
registers per-thread, fastest, kilobytes total shared / L1 per-SM, scratchpad, ~100 KB L2 cache shared across SMs, tens of MB HBM (VRAM) on the GPU board, ~80 GB, the big pool host RAM (over PCIe / NVLink) far away, slow
Closer to the cores: tiny and instant. Further away: huge and slow. Good GPU code lives at the top of the pyramid.

Modern GPUs also have tensor cores: special units inside each SM that don't add or multiply single numbers, they multiply small matrices in one shot. A tensor core can do a 4×4 matrix multiply-and-accumulate as a single instruction. That's the trick that makes AI workloads so fast on these chips: most of what a neural network does, at the bottom, is multiplying matrices.

6. Why bandwidth, not flops, is usually the limit

Marketing brochures love to quote a GPU's peak number of arithmetic operations per second: terabytes of flops, petabytes of flops, etc. In practice, you almost never get those numbers, because the cores spend most of their time waiting for data to arrive from HBM.

An H100, for example, can do roughly 1,000 trillion 16-bit operations per second. But its HBM can only deliver about 3 trillion bytes per second. If every value you operate on has to be freshly loaded from memory, the chip will sit idle most of the time.

The figure of merit is arithmetic intensity: how many calculations you do per byte of memory you read. Operations with high intensity (matrix multiplication, big convolutions) make the GPU happy. Operations with low intensity (adding two vectors, applying a simple function elementwise) are memory-bound: they finish only as fast as data can arrive.

arithmetic intensity (flops per byte loaded) throughput memory-bound (bandwidth limits you) compute-bound (flops limit you) "ridge point" vector add elementwise + activation matrix multiply big batched matmul
Roofline diagram. Operations to the left of the ridge can't go faster than the memory; only operations to the right see the full flops the brochure promised.
This explains a lot: why "batching" makes inference faster (you reuse the loaded weights across many inputs), why fused operations are faster than separate ones (one trip to memory instead of three), and why memory bandwidth and capacity, not raw flops, are usually what gets argued over when picking a GPU.

7. Kernels and CUDA

A program for the GPU is called a kernel. A kernel is a single function written in a GPU-aware language, that the host (the CPU side) launches with a grid of thread-block coordinates. Each thread runs the same function body, but with a different built-in index, and uses that index to decide which piece of data it's responsible for.

The simplest possible kernel: add two arrays elementwise.

__global__ void add(const float* a, const float* b, float* c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) c[i] = a[i] + b[i];
}

Every thread gets a different i from those built-in variables. With one million threads, that one kernel adds two million-element arrays in essentially one go.

This is the CUDA programming model, NVIDIA's name for the C-like language and runtime that lets you write kernels and launch them. AMD has a near-clone called HIP / ROCm; Apple has Metal; there's also OpenCL and SYCL. CUDA dominates in practice because NVIDIA shipped it earliest, kept extending it, and built a vast library ecosystem (cuBLAS for linear algebra, cuDNN for neural network primitives, cuFFT for Fourier transforms, etc.) that almost all AI software now sits on top of.

Why is CUDA such a moat? The hardware is hard to copy, but the software stack is harder. Two decades of libraries, two decades of every researcher publishing CUDA code, every framework targeting CUDA first. Rewriting all of that for another vendor's chips is the work of years, and it has to keep up with NVIDIA's moving target. Hence the wisecrack that NVIDIA's real product isn't the GPU, it's CUDA.

Most people writing AI code today never touch CUDA directly. They write PyTorch or JAX, which under the hood dispatches every operation to a pre-written CUDA kernel. The Python you write is a thin orchestrator; the heavy lifting happens in compiled C++ kernels you never see.

8. How GPUs ended up running AI

Neural networks are mostly matrix multiplications and a few simple elementwise functions on top. A forward pass through a transformer layer, for instance, is dominated by a handful of large matmuls: the attention projections, the feed-forward block, and the output projection. Same story for the backward pass during training.

Matrix multiplication is the textbook example of a high-intensity, embarrassingly parallel workload. Every output element is the sum of a row times a column; the rows and columns can be loaded once and reused across many outputs; the work splits naturally across thousands of threads. This is precisely what GPUs are best at, and tensor cores are tuned for.

The chain of events looked roughly like:

  1. 2007: NVIDIA releases CUDA. Scientific computing people start using GPUs for non-graphics work.
  2. 2012: AlexNet wins ImageNet by a wide margin, trained on two consumer GPUs. The deep learning revolution begins.
  3. 2017: The Transformer paper. The architecture turns out to scale beautifully with more GPUs and more data.
  4. 2018–2020: NVIDIA adds tensor cores; PyTorch and TensorFlow standardise on CUDA.
  5. 2022–now: LLMs eat the world. GPU clusters become strategic national infrastructure; the H100 sells for the price of a luxury car; AMD and the hyperscalers (Google's TPU, Amazon's Trainium, Apple's silicon, Meta's MTIA) all rush to build alternatives.

None of this was inevitable. Researchers in the 1990s had neural networks; they just didn't have hardware fast enough to train interesting ones. The GPU showed up, almost by accident, as the right shape of chip at the right moment. That's why every AI cluster you read about (xAI's Colossus, Microsoft's Stargate, Meta's RSC) is, fundamentally, a warehouse full of GPUs wired together.

9. Try it: shape a workload

The interactive widget below is a toy model of how a workload's shape changes whether the GPU is happy. Adjust the sliders and watch how utilisation and time-per-step shift.

GPU utilisation
Memory bandwidth used
Step time

Some patterns worth poking at:

Recap

A GPU is a chip that trades single-thread cleverness for sheer parallel headcount. Its thousands of small cores are organised into SMs, which run warps of 32 threads in lockstep. Its memory comes in layers: tiny and fast on chip, large and slow off chip, with most of the engineering effort going into not waiting for it.

That shape, lots of identical arithmetic on contiguous data, happens to be exactly the shape of matrix multiplication, which happens to be most of what a neural network does. Add a software stack (CUDA, cuDNN, PyTorch) and a couple of decades of accumulated tooling, and you get the chip that defines the current era of computing.

Numbers and rules of thumb are approximate and lean on NVIDIA's data-centre GPUs (A100 / H100 / B200 generation). Other vendors arrange the same basic ideas with different names: AMD calls warps "wavefronts" of 64, Apple calls SMs "GPU cores," and so on. The shape of the story is the same.