And it always gets prematurely optimized to fit a specific problem instead of being made general purpose compute engine.
FPGAs are essentially horrible systolic arrays. They're lumpy, and have weird routing hardware that isn't easy to abstract out. Those lead to multiple day compile times in some cases.
They don't pipeline things by default. The programming languages used are nowhere near a good fit to the hardware.
Systolic arrays are only used for fixed function computing within the context of individual instructions, e.g. a matrix multiplication or convolution unit. In practice however, most of these arrays are wave front processors, because programmable processing elements can take variable amounts of time per stage and therefore become asynchronous.
AMD's XDNA NPU is based around a wave front array of 32 compute tiles, each of which can either perform a 4x8 X 8x4 matrix multiplication in float16 or a 1024 bit vector operation per cycle.
It's funny how everyone in this thread is wrong on the high level concepts (blind leading the blind) but here you're wrong on the specifics too
> around a wave front array of 32 compute tiles, each of which can either perform a 4x8 X 8x4 matrix multiplication in float16 or a 1024 bit vector operation per cycle.
1. XDNA is exactly the opposite of a "wavefront" processor. Each compute core is single thread vector VLIW. So you have X number of independent fixed function operators.
2. There is no XDNA product with 32 cores and 4x8x4 matmul. Phoenix has 20 compute cores (16 usable) and performs 4x8x4. Strix has 32 and performs 16x32x8 matmul.
https://en.wikipedia.org/wiki/Systolic_array