Kioxia & Nvidia are already talking about 100M IOps SSD's directly attached to GPUs. This is less about running hte model & more about offboarding context for future use, but Nvidia is pushing KV cache to ssd. And using BlueField-4 which has PCIe on it to attach SSDs, process there. https://blocksandfiles.com/2025/09/15/kioxia-100-million-iop... https://blocksandfiles.com/2026/01/06/nvidia-standardizes-gp... https://developer.nvidia.com/blog/introducing-nvidia-bluefie...
We've already deepseek running straight off NVMe, weights runnig there. Slowly, but this maybe could scale. https://www.reddit.com/r/LocalLLaMA/comments/1idseqb/deepsee...
Kioxia for example has AiSAQ, which works in a couple places such as Milvus; not 100% clear but me exactly what's going on there, but it's trying to push work to the NVMe. And with NVMe 2.1 having computational storage, I expect we see more pushing work to the SSD.
These aren't directly the same thing as HBF. A lot is caching, but also, I tend to think there is an aspiration of trying to move some work out of ram, not merely to be able to load into ram faster.
The best part is the wafers are being bought with no plans to use them; just to keep them in storage so that competition cannot easily access RAM. Supervillain shit, should have been the last straw for PG to publicly denounce Sam and for OpenAI to be sued by the US government for anticompetitive practices. All this does is harm the consumer. Of course that ks never going to happen.
I guess the memory companies loved the deal because they knew prices would skyrocket. I wonder if there are rules against cornering the market that apply here.
All in all, it stinks.
but they do because they prefer holding commodities to us dollars
> The KAIST professor discussed an HBF unit having a capacity of 512 GB and a 1.638 TBps bandwidth.
One weird thing about this would be that it's still NAND flash and NAND flash still has limited read/write cycles, often measured in the thousands (Drive-Writes-a-Day across 5 years). If you can load a model & just keep querying it, that's not a problem. Maybe it's small enough to not be so bad, but my gut is that writing context here too might present difficulty.