VkFFT: Vulkan/CUDA/Hip/OpenCL/Level Zero/Metal Fast Fourier Transform Library

DTolm · on Aug 2, 2023

Hello, I am the author of VkFFT, Tolmachev Dmitrii.

I remember VkFFT got a lot of initial traction thanks to Hacker News three years ago. Back then VkFFT was a simple collection of pre-made shaders for powers of two FFTs.

Nowadays it is based on the runtime code generation and optimization platform that supports all the mentioned backends, has a wide range of implemented algorithms (some of which are not even present in other codes) to cover all system sizes and can do things no other GPU FFT library can so far (like real to real transforms, arbitrary dimensional transforms, zero-padding, convolutions and more).

If you have some questions about the library, design choices, functionality or anything else - I will be happy to answer them!

taminka · on Aug 2, 2023

* world if everyone just used vulkan *

* futuristic buildings and flying cars *

that’s pretty cool, the vk prefix for a non-vulkan-only library is kind of confusing tho

ajb · on Aug 2, 2023

It was originally Vulkan only, maybe they should have renamed it

Archit3ch · on Aug 2, 2023

I propose Fastest Fourier Transform in the South by Southwest (FFTSXSW).

hoosieree · on Aug 2, 2023

The way it works now with run-time codegen it should be Fast Fourier Transform Fixed That For You (FFTFTFY) which also somewhat evokes a butterfly diagram.

andrewmcwatters · on Aug 2, 2023

Vulkan is a terrible standard. And by relation, DirectX is now, too considering they're identical. The amount of absolutely worthless boilerplate is through the roof.

Here's a compact(!) implementation of "Hello, Triangle!" https://github.com/Planimeter/game-engine-3d/blob/97298715b2...

921 lines of doing nothing.

delta_p_delta_x · on Aug 3, 2023

I had this complaint myself not long ago, but Vulkan isn't targeted at new computer graphics programmers just learning about the graphics pipeline, or vertex and fragment shaders. It's certainly not optimised for the 'hello triangle' use case—your complaint is equivalent to someone saying 'why do I need #include <cstdio>, and a main() just to print "hello world"?' A lot of that boilerplate is write-once, meaning it's first-time setup that will never have to be done again after actually initialising the GPU. You'll find that extending that example to include texture mapping, mipmapping, supersampling, and even GPU compute will be a lot easier than the initial code.

Vulkan is targeted at graphics and game engine developers who want to extract the maximum possible performance from their GPUs and know the limitations of a global-state API like OpenGL. Vulkan allows extremely fine-grained pipeline management and synchronisation primitives, and adds in hardware ray-tracing support natively without having to uncannily bolt it on.

If all you want is to draw a three-coloured triangle, you can do that easier and faster with ShaderToy instead of fudging with Vulkan. If you want to write a fast, powerful, modern graphics engine, then you use Vulkan or D3D12.

People forget that GPUs are now massive slices of silicon with memory and power subsystems in their own right, and are obviously extremely powerful hardware. At some point, OpenGL itself becomes a bottleneck, or is difficult enough to program with that Vulkan becomes easier, and that's when the true utility and power of its extreme verbosity is displayed.

For the record, the boilerplate isn't '921 lines of nothing'—it's effectively setting the GPU up from scratch, similar to bootstrapping a CPU from 16-bit real mode.

pjmlp · on Aug 3, 2023

Except with the deprecation/stagnation of OpenGL, Vulkan is the only thing new computer graphics programmers have to learn from Khronos as native 3D API.

Use a middleware instead might be the answer.

Then we don't need really Vulkan, as the middleware already allows to use the best 3D API on each platform.

simjnd · on Aug 3, 2023

How is WebGPU in that regard?

andrewmcwatters · on Aug 3, 2023

I’m not new to this, and Vulkan isn’t targeted at game engine developers.

pjmlp · on Aug 3, 2023

At least DirectX comes with better tooling and documentation.

latchkey · on Aug 2, 2023

I'd like to see this tested on a 210/250 with ROCm 5.6. There are improvements in the latest release that might affect the benchmarks in a positive way.

latchkey · on Aug 3, 2023

Not quite what I asked for, but close enough for now...

https://github.com/DTolm/VkFFT/discussions/126

serialx · on Aug 2, 2023

Now we just need VkDNN

raphlinus · on Aug 2, 2023

To a first approximation, Kompute[1] is that. It doesn't seem to be catching on, I'm seeing more buzz around WebGPU solutions, including wonnx[2] and more hand-rolled approaches, and IREE[3], the latter of which has a Vulkan back-end.

[1]: https://kompute.cc/

[2]: https://github.com/webonnx/wonnx

[3]: https://github.com/openxla/iree

figomore · on Aug 2, 2023

Other option is Tinygrad [1] that has a WEBGPU backend and it works very for my case (Unet 3D).

[1] - https://tinygrad.org/

rcme · on Aug 2, 2023

You need VkBLAS first.

wiz21c · on Aug 2, 2023

Rust bindings!!! pleeeeze !

Very impressive performances. I'd be happy to have a comparison with regular CPU performances... If you put together the GPU time + GPU upload & download, is it faster than CPU overall ?

gary_0 · on Aug 2, 2023

Rust bindings are at the bottom of the readme: https://github.com/semio-ai/vkfft-rs

Also Python: https://github.com/vincefn/pyvkfft

wiz21c · on Aug 3, 2023

getiin' old, it's the second time today I miss the obvious, clearly visible, information :-(

randomNumber7 · on Aug 2, 2023

> If you put together the GPU time + GPU upload & download, is it faster than CPU overall?

That always depends on sample size and the hardware you use.

And like for all those kind of problems on it also depends on the parallelizability of the computation you are doing.

neverrroot · on Aug 2, 2023

Make AMD a 1’st class citizen.

throwaway073123 · on Aug 2, 2023

amd needs to make itself a first class citizen

ineedtocall · on Aug 2, 2023

Did NVIDIA's ascent to a 1T company result primarily from their substantial software investments, or is there another element that AMD needs to focus on to achieve similar recognition and adoption in the realm of GPU compute?

imtringued · on Aug 2, 2023

It's mostly software. Nobody gives a damn about your AI chip even if it is better than what AMD has and AMD's hardware is no slouch.

The other factor is that AMD's data center hardware is not available at any cloud provider so nobody even has access to the supposedly supported hardware.

tinpotpotata · on Aug 2, 2023

NVIDIA let you run CUDA on consumer GPUs and AMD didn’t let you run RoCm on consumer GPUs. Big mistake

dsego · on Aug 2, 2023

How does this compare to something like fftw3 or pffft?

pid-1 · on Aug 2, 2023

I think FFTW does not run on GPUs.

KeplerBoy · on Aug 2, 2023

Nvidia has something they call cuFFTW. Its basically a drop-in replacement for FFTW.

That's the kind of stuff Nvidia has offered for the last decade while AMD did god knows what.

https://docs.nvidia.com/cuda/cufft/index.html

dsego · on Aug 2, 2023

Right, I know, but what's the advantage of gpu vs cpu for fft, considering cpu-s support some vectorization and you need to format the data and send it to the gpu and back.

geokon · on Aug 2, 2023

As far as I understand that's not a very meaningful question b/c it depends on what CPU and what GPU. So it's a bit apples to oranges and depends on the user's configuration. There is a benchmark at the very bottom: https://openbenchmarking.org/test/pts/vkfft

Also, maybe a bit obvious.. but that even if there is no huge benefit - sending compute to the GPU frees up your CPU/application to do other things .. like keeping your application responsive :)

johnbcoughlin · on Aug 2, 2023

The FFT is rarely the only thing you're doing, so at the very least you get to keep the data local if it was already on gpu.

the_svd_doctor · on Aug 2, 2023

FFT is memory bound (it’s N log N flops for N bytes, so little arithmetic). GPU HBM is much faster than DRAM, so generally it’s much faster on GPU.

Gimpei · on Aug 2, 2023

I’d love to see this in torch. What are the odds?

Y_Y · on Aug 2, 2023

Implementing a custom layer isn't hard[0], that said if it were me I'd rather add it in the runtime, e.g. via a TensorRT plugin.

[0] e.g. https://jamesmccaffrey.wordpress.com/2021/09/02/example-of-a...

mathisfun123 · on Aug 2, 2023

That tut only shows you how to remix existing operators. That obviously won't work here (not what's required). You need to do this

https://pytorch.org/tutorials/advanced/dispatcher.html

More complex but still not that hard.

uoaei · on Aug 2, 2023

What?

https://pytorch.org/docs/stable/generated/torch.fft.fftn.htm...

earthnail · on Aug 2, 2023

But not backed by VkFFT. The implication of the comment is that it would make FFTs on various backends easier if it was implemented on VkFFT in the first place. Not sure that's true though, as I don't know how much code the various backends share.

Edit: as an example that I experienced first-hand, coremltools, which converts PyTorch to CoreML models, only gained FFT support very recently. It's also not really a PyTorch backend but a PyTorch code converter, though, so wouldn't benefit at all from PyTorch's FFT being backed by VkFFT. Still, good example that one shouldn't take FFTs for granted.

Gimpei · on Aug 2, 2023

Of course, but CuFFT is half the speed of VkFFT.