The article makes no mention at all of how much VRAM they might have. Other articles state plans for up to 192GB of HMB3e, and a power draw of 1000 watts.
I'm pretty sure the WSE-3 chip from Cerebrus is way more powerful. It has 900,000 cores, and something like 27 petaflops of bandwidth between those cores, and shares all the memory with them, in one chip. They are developing an 8 exaflop cluster right now.
Why cut up a a wafer of chips, package each with HBM, put the package on a board, connect to CPUs with a fabric, then tie them all back together with networking chips and cables? Their 3 new clusters are the top 3 biggest AI training platforms in the world. Comparing the WSE-3 against Nvidia's H100. "It's got 52 times more cores. It's got 800 times more memory on chip. It's got 7,000 times more memory bandwidth and more than 3,700 times more fabric bandwidth. But they don't sell the chips, they build the clusters and sell the compute power. Except for a couple they built in Dubai.
Plus, newer models are moving from Transformer to Mamba and don't need as much memory, because they save the important information, not everything it's been trained on.
For training, you need to follow a gradient, and usually the gradient is small enough that you need the precision. Training in 8-bit is fairly new but Google and now Grok seem to have gotten good results.