I think about this regularly when I compile C++ or rust using llvm. It’s an excellent compiler backend. It produces really good code. But it is incredibly slow, and for no good technical reason. Plenty of other similar compilers run circles around it.
Imagine an llvm rewrite by the people who made V8, or chrome or the unreal engine. Or the guy who made luajit or the Go compiler team. I’d be shocked if we didn’t see an order of magnitude speed up overnight. They’d need some leeway to redesign llvm IR of course. And it would take years to port all of llvm’s existing optimisations. But my computer can retire billions of operations per second. And render cyberpunk at 60fps. It shouldn’t take seconds of cpu time to compile a small program.
The way I see it, mathematicians have been trying (and somewhat succeeding every 5~ years) to prove faster ways to do matrix multiplications since the 1970s. But this is only in theory.
If you want to implement the theory, you suddenly have many variables you need to take care of such as memory speed, cpu instructions, bit precision, etc. So in practice, an actual implementation of some theory likely have more room to improve. It is also likely that LLM's can help figure out how to write a more optimal implementation.
Lol, this will be potentially much slower than using the general matmul kernel.
However, I like this kind of research because it really exploits specific hardware configurations and makes it measurable faster (unlike some theoretical matmul improvements). Code specialization is cheap, and if it saves in the order of a few %, it quickly reimburses its price, especially for important things like matmul.
[1]: https://www.argmin.net/p/lore-laundering-machines