CUDA / systems
Profiling a fused attention kernel on H100: bandwidth wins, wall-time losses, and what the profiler said
This post came out of a CUDA project on H100 where the original question sounded simple: if projection and attention are kept closer together, can we cut enough HBM traffic to make inference meaningfully faster? The answer turned out to be “sometimes, but not for the reason I first wanted.” Lower bandwidth was part of the story. The profiler made it obvious that it was not the whole story.
Setup: H100 SXM5, 132 SMs, 3.35 TB/s HBM bandwidth, and a three-kernel family comparing a fully fused scalar path with two hybrid paths that split projection and attention differently.
The problem setup
Standard transformer inference paths tend to split QKV projection and attention into separate stages. That means writing intermediate tensors to HBM and then pulling them back into the next stage. On H100, that round-trip is expensive enough to be worth challenging, but the more interesting question is what happens after you reduce it. Do you actually get wall-time back, or do you simply reveal a different bottleneck?
What I built
We explored three paths: a fully fused scalar kernel that combined projection and attention without writing Q, K, or V back to HBM, a hybrid path that kept projection on cuBLAS and replaced attention with a custom tiled kernel, and a warp-cooperative bf16 hybrid path that became the strongest custom runtime variant.
My contribution centered on the tiling strategy, shared-memory layout, an HBM traffic model, a NumPy-based oracle for correctness, and WMMA benchmarking to test the Tensor Core direction directly.
The first wrong assumption
The naive mental model was that if the fused path removed enough global-memory traffic, the rest would take care of itself. That part did not hold. The fully fused scalar path did reduce memory movement, but it still lost badly in wall time because the projection loops were running in scalar fp32 and leaving Tensor Cores idle.
What the profiling stack actually captured
The project used CUDA-event timing for the final wall-clock comparisons, NVTX annotations plus Nsight
Systems for timeline inspection, and an analytical HBM model that was checked against the committed result
tables. The repo also prepared an occupancy sweep and Nsight Compute metric hooks around
l1tex__t_bytes and
sm__warps_active.avg.pct_of_peak_sustained_active, but the published artifact set did not
preserve a complete final dump of measured occupancy or L2 counters. I am keeping that distinction explicit
here because the useful profiler evidence from this project was real, but not every planned counter made it
into the final committed output.
Why the tile shape mattered
The final fused kernel used a T = 64 tile shape. That gave a workable balance between
shared-memory residency and thread-level coordination without pushing the kernel into an even more fragile
register or occupancy tradeoff. The shared-memory layout used [T][d+1] padding to avoid bank
conflicts while staging Q, K, and V tiles.
The stronger reason for T = 64 was hardware budget, not aesthetics. At
d = 64, the four shared-memory arrays sQ, sK, sV, and
sO consume about 65 KB per block on H100, which still allows two resident blocks per SM.
Going larger pushed the kernel toward a one-block-per-SM regime, and smaller tiles left the kernel
under-filled enough that the extra coordination overhead stopped being worth it.
// simplified shared-memory sketch
constexpr int T = 64;
__shared__ float q_tile[T][D + 1];
__shared__ float k_tile[T][D + 1];
__shared__ float v_tile[T][D + 1];
for each tile in sequence:
load tile into shared memory
compute local attention scores
accumulate outputs
// padding on the second dimension helps avoid bank conflicts
Profiler view of the workflow
Kernel family diagram
The profiler changed the question from “did fusion help memory?” to “what is left once memory stops being the dominant explanation?”
What the numbers said
| Signal | Observed result | What it meant |
|---|---|---|
| Correctness | 11 / 11 tests passed, max absolute error ≤ 1.5 × 10-7 | The custom kernels were close enough to the PyTorch reference path to make the profiler output trustworthy. |
| HBM reads | up to 54.6% lower | The original memory-traffic hypothesis was real. The QKV round-trip was expensive enough to matter. |
| Exact HBM read estimate at N = 1024 | 11.53 MB baseline vs 5.24 MB fused | The largest sequence length made the memory gap concrete instead of leaving it at a percentage headline. |
| Best short-sequence speedups | 1.22x at N = 64, 1.41x at N = 128 |
The hybrid bf16 path turned the memory improvement into real latency wins in the right regime. |
| Peak allocation | 85–91% lower | The best custom path improved memory behavior even when it was not the fastest option at every sequence length. |
| Main remaining bottleneck | scalar fp32 projection loops | The next optimization step was clearly compute-side, not just more fusion. |
Where the crossover happened
| Sequence length | PyTorch baseline | Best custom kernel | Outcome |
|---|---|---|---|
| 64 | 107.2 µs | 88.0 µs | 1.22x faster |
| 128 | 142.1 µs | 100.5 µs | 1.41x faster |
| 256 | 127.2 µs | 164.2 µs | custom path loses |
| 512 | 121.6 µs | 293.3 µs | baseline clearly ahead |
| 1024 | 200.7 µs | 556.4 µs | baseline clearly ahead |
Why this path instead of just leaning on cuBLAS or Triton
cuBLAS stayed in the loop for the best hybrid path because projection was exactly the part that benefited most from a mature Tensor Core implementation. The goal of this project was not to prove that every stage had to be rewritten from scratch. It was to understand what changed when the attention path was made more explicit and memory movement was reduced.
I also did not frame this as a Triton exercise. Triton would be a reasonable follow-up, but the point here was to understand the shared-memory layout, bank-conflict padding, and WMMA direction at the CUDA level first, because that is where the profiler evidence became easiest to trust.
What did not work
The fully fused scalar path is the part I would not hide. It looked like the most aggressive optimization, and it was not the best runtime result. That was useful. It made the next step obvious: once the HBM cost comes down, a weak compute path becomes easier to see.
In other words, the profiler did not tell me “the idea failed.” It told me the optimization had already moved the project into a different bottleneck class. The moment the QKV round-trip cost fell enough, the scalar fp32 projection loops became impossible to ignore.
What I would do next
The follow-up is straightforward: move the remaining hot projection math onto a better Tensor Core path and treat the current fused scalar kernel as a measurement tool rather than the final answer. The WMMA benchmarking already pointed in that direction.
Transferable principle
If bandwidth improves but runtime does not, that is usually not a failed optimization. It is a profiler telling you the next bottleneck finally has room to show itself.