Why GPU Matmul Speed Depends on Input Data: A Power-Throttling Story

An engineer benchmarking CUTLASS against CuBLAS on an A100 stumbled onto a puzzling result: the same 8192³ matrix multiplication ran roughly 15% faster when fed zeros than when fed normally-distributed random values. Since matmul kernels execute the same instructions in the same order regardless of operand values, input-dependent timing should be impossible. The culprit turned out to be physics, not software.

The explanation lies in dynamic switching power. Every transistor flip burns a tiny amount of energy, and across billions of transistors on a GPU running at full tilt, random data causes far more state transitions than predictable data. When total power draw hits the 400W cap, the voltage regulator throttles the clock to compensate, dragging throughput down. Zeros barely flip anything; uniform values flip less than signed normals that toggle accumulator signs. Tests across distributions — checkerboards, ternary, single-bit, all-π, all-twos — line up cleanly with how much bit-level churn each pattern produces.

Lowering the power limit widens the predictable-vs-random gap, while lowering the clock limit closes it, since the chip stops hitting the power ceiling. The practical takeaway: GPU benchmarks using zero-initialized or integer-only inputs systematically overstate real-world matmul throughput, and hardware reviewers and kernel authors need to use realistic data distributions to get honest numbers.