FFT is only O(N log N) for a vector of length N WRT to matrices for an N by N ma...

xphos · 2025-02-27T03:59:57 1740628797

Thank you for that catch.

I still think we are comparing ASIC matmul hardware to non ASIC FFT hardware. The given TPU hardware is doing 256x256 matrix multiplication in linear time by using 256x256 multiplier grids. FFT ASIC could like do the same thing but be able to handle a much higher N size before memory becomes the bottleneck.

touisteur · 2025-02-27T19:00:15 1740682815

Part of the FFT can be accelerated on GPU hardware, which is full of butterfly-like instructions within warps. Using overlap-and-add/overlap-and-save and cuFFTDx can also make use of heavy parrallelism within shared memory. I had a hard time reproducing the tcFFT paper (for lack of time and tensor core skills I guess) but you can also keep your data in Tensor Core registers too, apparently.

SJC_Hacker · 2025-02-27T05:11:23 1740633083

The downside of a dedicated ASIC, besides the fixed size of the multipliers, which isn't that big of a deal because matrix multiplication can be broken down into blocks anyway, is the precision(16-bit, 8-bit) and data format (floating point vs. integer/fixed) are immutable

roflmaostc · 2025-02-27T09:07:13 1740647233

anything is "constant" time if you build big enough hardware and if it's fixed size.