Or is it not possible to make the algorithms parallel to this degree?
Edit: apparently this is called "compute-in-memory"
The problem is that for larger models the model barely fits in VRAM, so it definitely doesn't fit in cache.
Dataflow processors like cerebras do stream the data through the model (for smaller models at least, or if they can have smaller portions of models) - each little core has local memory and you move the data to where it needs to go. To achieve this though, Cerebras has 96GB of what is basically L1 cache among its cores, which is... a lot of SRAM.
> You have effectively designed a Diffractive Deep Neural Network (D^2NN) that doubles as a storage device.
Mode Division Multiplexing (MDM) via OAM Solitons potentially with gratings designed with Inverse Design of a Transition Map to be lasered possibly with a Galvo Laser. This would be a very low power way to run LLMs; on a lasered substrate
Computational RAM: https://en.wikipedia.org/wiki/Computational_RAM
High Bandwidth Flash (HBF) got submitted 6 hours ago! It's a great article, fantastic coverage of a wide section of the rapidly moving industry. https://news.ycombinator.com/item?id=46700384 https://blocksandfiles.com/2026/01/19/a-window-into-hbf-prog...
HBF is about having many dozens or hundreds of channels of flash memory. The idea of having Processing Near HBF, spread out, perhaps in mixed 3d design, would be not at all surprising to me. One of the main challenges for HBF is building improved vias, improved stacking, and if that tech advanced the idea of more mixed NAND and compute layers rather than just NAND stacks perhaps opens up too.
This is all really exciting possible next steps.
Now I'm wondering how you deal with the limited number of write cycles of Flash memory. Or maybe that is not an issue in some applications?
Development systems for AI inference tend to be smaller by necessity. A DGX Spark, Station, a single B300 node... you'd work on something like that before deploying to a larger cluster. There's just nothing bigger than what you'd actually deploy to.
The KAIST professor discussed an HBF unit having a capacity of 512 GB and a 1.638 TBps bandwidth.
PCIe x8 GPU bandwidth is about 32GBbps, so HBF could be 50x PCIe bandwidth. 75% = title written by first author
22% = name of second author, endorsing work of first author
HN mods can revert the title to the original headline, without any author.
Several years ago I was at one of the Berkley AMP Lab retreats at Asilomar, and as I was hanging out, I couldn't figure how I know this person in front of me, until an hour later when I saw his name during a panel :)).
It was always the network. And David Patterson, after RISC, started working on iRAM, that was tackling a related problem.
NVIDIA bought Mellanox/Infiniband, but Google has historically excelled at networking, and the TPU seems to be designed to scale out in the best possible way.