I came to graphics drivers from embedded. At Baykar I did board bring-up on a TI AM67A — writing device tree, setting up DDR, poking registers directly. That background made the graphics stack look familiar in some ways and strange in others.
When I started reading Mesa, I needed a mental model for where everything fit.
Three layers, ignoring firmware: kernel driver, userspace driver, shader compiler.
The kernel driver does what only the kernel can do — memory management,
command submission from multiple processes, display controller, power, GPU reset when
things go wrong. On Linux this lives in the DRM subsystem. For AMD it's amdgpu.
The userspace driver implements the API — Vulkan, OpenGL, whatever.
This is where most of the logic lives. For AMD Vulkan it's RADV, which is part of Mesa.
It records command buffers, manages pipeline state, uploads resources. When you call
vkQueueSubmit, it eventually does an ioctl into the kernel driver to actually
submit work.
The shader compiler takes bytecode (SPIR-V for Vulkan) and produces GPU machine code. GPU ISAs aren't standardized — gfx9 and gfx11 are different enough that you can't run the same binary on both. So compilation happens at runtime, on your machine, for your specific GPU. For RADV this is ACO.
Coming from embedded, this surprised me at first. On a microcontroller you cross-compile once and flash the binary. GPUs don't work that way. Every generation has a different ISA, different register counts, different instruction latencies. Even same-generation cards differ in shader core count, cache sizes, memory bandwidth — all of which affects how a good compiler schedules instructions.
SPIR-V (Vulkan's shader bytecode) is the portable format. The driver compiles it on first use. That's the "compiling shaders" stutter in new games — the driver is compiling every shader the first time it sees it.
ACO is the compiler backend inside Mesa for AMD GPUs. It takes NIR (Mesa's IR) and produces AMD machine code. It handles instruction selection, register allocation, and scheduling.
Register allocation on a GPU is different from CPU. You want more live registers to hide memory latency — the GPU hides a stall by switching to another wavefront. But more registers means fewer concurrent wavefronts, which means less parallelism to hide stalls with. It's a tradeoff the compiler has to balance per shader.
This is the part of the stack I'm working on. Before getting here I needed to understand everything above it — what RADV passes into ACO, what the kernel expects to receive, how it all connects.