CUDA / systems

Profiling a fused attention kernel on H100: bandwidth wins, wall-time losses, and what the profiler said

Apr 28, 2026

This post came out of a CUDA project on H100 where the original question sounded simple: if projection and attention are kept closer together, can we cut enough HBM traffic to make inference meaningfully faster? The answer turned out to be “sometimes, but not for the reason I first wanted.” Lower bandwidth was part of the story. The profiler made it obvious that it was not the whole story.

Setup: H100 SXM5, 132 SMs, 3.35 TB/s HBM bandwidth, and a three-kernel family comparing a fully fused scalar path with two hybrid paths that split projection and attention differently.

The problem setup

Standard transformer inference paths tend to split QKV projection and attention into separate stages. That means writing intermediate tensors to HBM and then pulling them back into the next stage. On H100, that round-trip is expensive enough to be worth challenging, but the more interesting question is what happens after you reduce it. Do you actually get wall-time back, or do you simply reveal a different bottleneck?

What I built

We explored three paths: a fully fused scalar kernel that combined projection and attention without writing Q, K, or V back to HBM, a hybrid path that kept projection on cuBLAS and replaced attention with a custom tiled kernel, and a warp-cooperative bf16 hybrid path that became the strongest custom runtime variant.

My contribution centered on the tiling strategy, shared-memory layout, an HBM traffic model, a NumPy-based oracle for correctness, and WMMA benchmarking to test the Tensor Core direction directly.

The first wrong assumption

The naive mental model was that if the fused path removed enough global-memory traffic, the rest would take care of itself. That part did not hold. The fully fused scalar path did reduce memory movement, but it still lost badly in wall time because the projection loops were running in scalar fp32 and leaving Tensor Cores idle.

What the profiling stack actually captured

The project used CUDA-event timing for the final wall-clock comparisons, NVTX annotations plus Nsight Systems for timeline inspection, and an analytical HBM model that was checked against the committed result tables. The repo also prepared an occupancy sweep and Nsight Compute metric hooks around l1tex__t_bytes and sm__warps_active.avg.pct_of_peak_sustained_active, but the published artifact set did not preserve a complete final dump of measured occupancy or L2 counters. I am keeping that distinction explicit here because the useful profiler evidence from this project was real, but not every planned counter made it into the final committed output.

Why the tile shape mattered

The final fused kernel used a T = 64 tile shape. That gave a workable balance between shared-memory residency and thread-level coordination without pushing the kernel into an even more fragile register or occupancy tradeoff. The shared-memory layout used [T][d+1] padding to avoid bank conflicts while staging Q, K, and V tiles.

The stronger reason for T = 64 was hardware budget, not aesthetics. At d = 64, the four shared-memory arrays sQ, sK, sV, and sO consume about 65 KB per block on H100, which still allows two resident blocks per SM. Going larger pushed the kernel toward a one-block-per-SM regime, and smaller tiles left the kernel under-filled enough that the extra coordination overhead stopped being worth it.

// simplified shared-memory sketch
constexpr int T = 64;
__shared__ float q_tile[T][D + 1];
__shared__ float k_tile[T][D + 1];
__shared__ float v_tile[T][D + 1];

for each tile in sequence:
  load tile into shared memory
  compute local attention scores
  accumulate outputs

// padding on the second dimension helps avoid bank conflicts

Profiler view of the workflow

Kernel family diagram

Standard path projection and attention as separate stages intermediate QKV written to and read from HBM

Fully fused path cuts the QKV round-trip but leaves scalar fp32 projection loops in the hot path

Best custom path projection stays on cuBLAS Tensor Cores custom bf16 attention kernel wins at short sequence lengths

The profiler changed the question from “did fusion help memory?” to “what is left once memory stops being the dominant explanation?”

What the numbers said

Signal	Observed result	What it meant
Correctness	11 / 11 tests passed, max absolute error ≤ 1.5 × 10^-7	The custom kernels were close enough to the PyTorch reference path to make the profiler output trustworthy.
HBM reads	up to 54.6% lower	The original memory-traffic hypothesis was real. The QKV round-trip was expensive enough to matter.
Exact HBM read estimate at N = 1024	11.53 MB baseline vs 5.24 MB fused	The largest sequence length made the memory gap concrete instead of leaving it at a percentage headline.
Best short-sequence speedups	1.22x at `N = 64`, 1.41x at `N = 128`	The hybrid bf16 path turned the memory improvement into real latency wins in the right regime.
Peak allocation	85–91% lower	The best custom path improved memory behavior even when it was not the fastest option at every sequence length.
Main remaining bottleneck	scalar fp32 projection loops	The next optimization step was clearly compute-side, not just more fusion.

Where the crossover happened

Sequence length	PyTorch baseline	Best custom kernel	Outcome
64	107.2 µs	88.0 µs	1.22x faster
128	142.1 µs	100.5 µs	1.41x faster
256	127.2 µs	164.2 µs	custom path loses
512	121.6 µs	293.3 µs	baseline clearly ahead
1024	200.7 µs	556.4 µs	baseline clearly ahead

Why this path instead of just leaning on cuBLAS or Triton

cuBLAS stayed in the loop for the best hybrid path because projection was exactly the part that benefited most from a mature Tensor Core implementation. The goal of this project was not to prove that every stage had to be rewritten from scratch. It was to understand what changed when the attention path was made more explicit and memory movement was reduced.

I also did not frame this as a Triton exercise. Triton would be a reasonable follow-up, but the point here was to understand the shared-memory layout, bank-conflict padding, and WMMA direction at the CUDA level first, because that is where the profiler evidence became easiest to trust.

What did not work

The fully fused scalar path is the part I would not hide. It looked like the most aggressive optimization, and it was not the best runtime result. That was useful. It made the next step obvious: once the HBM cost comes down, a weak compute path becomes easier to see.

In other words, the profiler did not tell me “the idea failed.” It told me the optimization had already moved the project into a different bottleneck class. The moment the QKV round-trip cost fell enough, the scalar fp32 projection loops became impossible to ignore.

What I would do next

The follow-up is straightforward: move the remaining hot projection math onto a better Tensor Core path and treat the current fused scalar kernel as a measurement tool rather than the final answer. The WMMA benchmarking already pointed in that direction.

Transferable principle

If bandwidth improves but runtime does not, that is usually not a failed optimization. It is a profiler telling you the next bottleneck finally has room to show itself.