MTIA v1's specs: The accelerator is fabricated in TSMC 7nm process and runs at 800 MHz, providing 102.4 TOPS at INT8 precision and 51.2 TFLOPS at FP16 precision. It has a thermal design power (TDP) of 25 W. Up to 128 GB of ram LPDDR5.
So it seems that the Google Cloud TPU v4 has an advantage in terms of compute per chip and ram speed, but the Meta one is much more efficient (2x to 4x, it is hard to tell) and has more ram but it is slower ram?
FWIW, you're comparing a training-specialized chip to an inference-specialized chip. It'd be more apples to apples to compare to TPU v4 lite, but I can't find that chip's details anywhere beyond some mentions in the TPU v4 paper: https://arxiv.org/abs/2304.01433
How does a training specialized chip function? Forward mode is simple, just a dot product machine. But how do you accelerate backprop on hardware? Does it have the vector Jacobian transformation lookup logic and table baked into hardware?
Mostly you need to be able to stash intermediate products computed in the forward phase so that you can access them in the backward phase. This requires more memory, more memory bandwidth, more transpose, and also, training usually operates at slightly higher precision (bf16 instead of int8 as one example).
I think it's helpful to categorize the things that go into an ML accelerator as those that are big picture architectural - things like memory bandwidth and sizes, support for big operations like transposition, etc., -- and those that are fixed-function optimizations. In all of these systems, there's a compiler that's responsible for taking higher-level things and compiling them down to those low-level operations. And that includes the derivatives used in backprop - they just get mapped to the same plus a few more primitive operations. While there are few more fixed functions you need to add for loss functions and some derivatives, probably the largest difference is that you need to support transpose (and that you need all that extra memory & bandwidth to keep those intermediate products around in order to backprop on them)
Look for the section "CHALLENGES AND OPPORTUNITIES OF BUILDING ML HARDWARE"
But then things change more when you want to start supporting embeddings, so Google's TPUs have included a "sparse core" to separately handle those (the lookup and memory use patterns are drastically different from that of the typical dense matrix operations used for non-embedding layers) since TPUv2: https://arxiv.org/pdf/2304.01433.pdf
This looks like a customized ASIC specializing solely in recommendation systems possibly focused on ads ranking
>We found that GPUs were not always optimal for running Meta’s specific recommendation workloads at the levels of efficiency required at our scale. Our solution to this challenge was to design a family of recommendation-specific Meta Training and Inference Accelerator (MTIA) ASICs.
You must be Gen Z because the prevailing attitude towards video games of all kinds was mostly negative in the 20th century. They even tried to blame Doom for school shootings.
Probably the software needs to be optimized for the hw and also the hw may not be general purpose enough even if offered. People demand nvidia because cuda is very optimized for their gpus and many AI software use cuda
You come up with a clever ASIC that is better than their current GPU for your workload… and by the time it comes out they’ve released the next year’s chip that just has like 50% more memory bandwidth or something ridiculous like that, and beats you by pure grunt.
“No replacement for displacement” actually seems to be true in compute.
This is a popular myth. Bitcoin asic's were 'shipping' in 2012/2013.
Some companies definitely played games and mined with the asic's themselves (and then shipped those used asic's)... but in general, it was always a lot more profitable to sell the shovels than it was to mine the gold.
These seem power and density optimized. This sort of custom hardware is all about supply chains and getting a lot of them everywhere. This flavors the inference use-case.
For large training jobs it is more about turn around time; running hideously expensive GPUs sucking down huge amounts of power is fine.
It looks rather general-purpose (for ML tasks) to me:
Each PE is equipped with two processor cores (one of them equipped with the vector extension) and a number of fixed-function units that are optimized for performing critical operations, such as matrix multiplication, accumulation, data movement, and nonlinear function calculation. The processor cores are based on the RISC-V open instruction set architecture (ISA) and are heavily customized to perform necessary compute and control tasks.
It is ambiguous on that front. If you designed it in 2020, getting through test runs at TSMC and then to a final production run would take a while. So when they had it deployed at scale at FB is unclear.
>>>> fabricated in TSMC 7nm process and runs at 800 MHz, providing 102.4 TOPS at INT8 precision and 51.2 TFLOPS at FP16 precision. It has a thermal design power (TDP) of 25 W.
So 2 generation of immediate improvement available.
TOPS might take a more complex operation as a unit, same as computing shader passes per second might mean a very simple computation or a very complex operation every s^-1
Has there been any rumors or statements from Facebook on them eventually stepping into selling cloud compute? I'd be surprised if they are investing in building hardware accelerators just for their own services.
Given that these chips seem to be power optimised and Facebook's recently released sensory model, I wouldn't be surprised to see them in their next iteration of VR devices.
The AI inference/training market is so competitive that I doubt enterprise sales is going to be the problem. A company planning on spending $50M training a model is not going to be convinced by some smooth talking sales guy over a golf game. They will look at the actual price/performance.
Just as incredible is the corresponding announcement of their RSC which is purportedly one of the world's most powerful clusters
Amazing times! Private companies now have compute resources previously only showing up in government labs, and in many cases using novel components like MTIA
This feels like the start of a golden age and in a few years we will have incredible results and breakthroughs
MTIA v1's specs: The accelerator is fabricated in TSMC 7nm process and runs at 800 MHz, providing 102.4 TOPS at INT8 precision and 51.2 TFLOPS at FP16 precision. It has a thermal design power (TDP) of 25 W. Up to 128 GB of ram LPDDR5.
Googles Cloud TPU v4: 275 teraflops (bf16 or int8), 90/170/192 W. 32 GiB of HBM2 RAM, 1200 GBps. From here: https://cloud.google.com/tpu/docs/system-architecture-tpu-vm...
So it seems that the Google Cloud TPU v4 has an advantage in terms of compute per chip and ram speed, but the Meta one is much more efficient (2x to 4x, it is hard to tell) and has more ram but it is slower ram?