Understanding Streaming Multiprocessors in NVIDIA GPUs

Streaming Multiprocessors (SMs) in NVIDIA GPUs act like the cores in CPUs, handling computations and holding state in registers with caches nearby. But unlike beefy CPU cores, SMs keep things simple and straightforward—no fancy speculation or branch prediction here. Instead, they shine by juggling tons of threads at once, making them perfect for parallel workloads.

What Are SMs?

Think of SMs as the workhorses inside your NVIDIA GPU. They're pipelined for instructions, much like CPUs have been since the '90s, but without the bells and whistles of speculative execution or predicting where the code jumps next. What they lack in smarts, they make up for in sheer numbers and speed at handling parallelism.

Warps and Schedulers

A warp is basically a squad of 32 threads that all run the same instruction at the same time—it's the GPU's fundamental scheduling unit. Warp schedulers keep things humming by swapping between these warps every clock cycle, ensuring the compute units stay busy and latencies from memory or instructions get hidden away.

Comparing to CPUs

Let's put this in perspective. An AMD EPYC 9965 CPU tops out at 500W with 192 cores, each juggling up to two threads, for a max of 384 threads in parallel at about 1.25W per thread. Flip to an H100 SXM GPU: 700W, 132 SMs, each with four warp schedulers issuing to 32 threads per cycle. That's 128 threads per SM times 132 SMs, hitting 16,896 parallel threads at roughly 5cW each. And yeah, it's real parallelism—every thread advances per cycle.

A single SM on the H100 can manage up to 2048 concurrent threads, divided into 64 warps. Across all SMs, that's over 250,000 threads running at once. CPUs handle multiple threads too, but GPU warp switches are lightning-fast—every clock cycle, over 1000 times quicker than CPU context switches. This quick pivoting, courtesy of the SM's warp schedulers, masks delays from memory or sync ops, keeping CUDA and Tensor cores cranking.

For kicks, consider a modest PC CPU at 2.6GHz with 6 cores: that's 15.6 billion clock cycles per second total. GPUs scale that parallelism way up.

Inside the SM: Units at Work

Beyond CUDA and Tensor cores, SMs pack Special Function Units (SFUs) for tricky math like sine, cosine, reciprocals, and transcendentals. These offload the heavy lifting so main cores don't bog down. Load/Store Units (LSUs) shuffle data between registers, shared memory, and global memory. The smooth interplay between CUDA cores, SFUs, and LSUs lets SMs juggle thousands of threads without grinding to a halt on data moves or complex ops.

Together, these form the GPU's powerhouse, striking a balance between crunching numbers and moving bits to fuel efficient parallel execution.

Launching a CUDA Kernel

When you kick off a CUDA kernel, it's not like calling a C++ function. You're handing over a blueprint for parallelism, often in PTX intermediate form, to the NVIDIA driver. The driver acts as a just-in-time compiler, turning PTX into SASS—the GPU's native assembly. This isn't a hint to the compiler; it's configuring the GPU hardware directly. The CPU bows out after describing the job.

The GPU's Gigathread Engine, its global scheduler, doesn't peek at your loops or logic. It sees the grid of thread blocks you defined as a work queue and assigns them to available SMs. This load balancing is pure hardware magic, out of your or the compiler's hands.

Thread Blocks and Resource Allocation

Once a thread block hits an SM, it's stuck there until done—resources get physically locked in. Take a simple line like float my_val = data[idx];—that's claiming a spot in the SM's 256KB register file. Pile on variables like float r1, r2, r3;, and your register footprint balloons, limiting how many threads fit on the SM.

The compiler tweaks register use, but can't expand the hardware's limits. High register demands mean fewer threads, hurting the GPU's latency-hiding strategy: oversubscription.

Latency Hiding and Oversubscription

Against memory delays, GPUs don't predict—they oversubscribe. When a warp hits a slow op like val = global_mem[idx];, a scoreboard (a bitmask tracking dependencies) flags the register as pending. The warp scheduler spots the stall and flips to another ready warp next cycle. No software overhead—just zero-cost hardware switching.

High occupancy from low register use gives the scheduler plenty of warps to pick from, keeping things humming.

Execution Pipelines and ILP

SMs are a bundle of specialized pipelines. A warp's instruction, say c = a + b;, routes to the right unit: INT32 ALUs for integers, FP32 CUDA cores for floats. In one cycle, the SM might handle an int op for warp 7, a load for warp 3, and a multiply for warp 12. That's instruction-level parallelism on top of your thread-level setup. The compiler plans it, hardware runs it.

Memory Access Patterns

Here's where your code meets hardware reality. For global_mem[base_index + threadIdx.x], the LSU collects 32 addresses and minimizes 128-byte cache line fetches. Sequential? Probably one or two lines—perfect coalescing, super efficient.

But global_mem[base_index + threadIdx.x * 32]? Scattered addresses mean tons of separate transactions, slowing everything. The compiler can't reorder without breaking logic, per the "as-if" rule.

Shared Memory and Bank Conflicts

Shared memory is dedicated SRAM with 32 banks. Bank = address % 32. If threads hit different banks like shared_data[threadIdx.x], it's full-speed parallel access. But all targeting shared_data[10]? Bank 10 gets slammed, serializing 32 requests via the crossbar—32x slowdown from bank conflicts. Your layout dictates performance; compiler can't save you.

Control Flow Divergence

Divergence hits hard in the parallel world. For if (threadIdx.x < 16), hardware checks all 32 threads. If split, it serializes: runs the if path with an active mask disabling non-qualifiers, then the else with the mask flipped. You pay the time for both paths sequentially. No CPU-like branch reordering here—hardware just follows orders.

GPU Execution Flow
Kernel Launch → PTX to SASS (Driver) → Gigathread Engine Queues Blocks → Assign to SMs → Allocate Registers → Warp Scheduling → Pipeline Dispatch (CUDA/SFUs/LSUs) → Memory Access (Coalescing/Banks) → Handle Divergence with Masks.

Compiler vs. Hardware Limits

The NVIDIA driver, like any compiler, optimizes within GPU bounds: unrolls loops, schedules instructions, manages registers. But it's tied to the "as-if" rule and can't invent new algorithms. It won't fix divergent code, swap SoA to AoS for coalescing, or resolve bank conflicts. Big wins come from aligning your intent with hardware—avoid divergent algos, bad data layouts, or fighting the memory controller.

Bottom Line

SMs drive NVIDIA GPUs with massive parallelism via warps, schedulers, and units like SFUs/LSUs. Kernel launches hand off to hardware for block assignment and execution, hiding latency through oversubscription. Key pitfalls: high register footprints, poor coalescing, bank conflicts, divergence. Optimize by matching code to hardware design, not micro-tweaks.