You've decided to write fast code for an NVIDIA Hopper GPU. Maybe you want to build a custom attention kernel. Maybe you're trying to understand how CUTLASS and ThunderKittens work under the hood.

Either way, before you can use any of the cool Hopper hardware — TMA, wgmma, mbarriers, clusters — you need to understand one thing: how memory works when thousands of threads share it. That's what the m