Hacker News new | past | comments | ask | show | jobs | submit login
VkFFT – Vulkan Fast Fourier Transform library (github.com/dtolm)
123 points by DTolm on Nov 20, 2020 | hide | past | favorite | 50 comments



Hello! Since the last post VkFFT has experienced a number of huge improvements and optimizations. Namely:

-It now supports sequences up to 2^32 in all dimensions (algorithmically, in reality limited to allocatable memory size, switch to 64-bit addressing scheme is planned for future release)

-configurations optimized for bigger range of systems and vendors

-benchmarked Radeon VII and RTX 3080, shows that FFT is extremely bandwidth limited on modern GPUs

-VkFFT is able to match and outperform cuFFT on the whole tested range from 2^7 to 2^28 in single precision

-added double and half precision support and precision tests against FFTW on CPU

-improved native zeropadding - up to 3x performance boost

-switched license to MPL 2.0

Thanks for your attention! I am happy to answer any questions.


> VkFFT is able to match and outperform cuFFT on the whole tested range from 2^7 to 2^28 in single precision

What is your explanation for this?

Is the VkFFT algorithm better? Is SPIR-V fundamentally more expressive than PTX? Are nVidia drivers better at compiling SPIR-V than PTX?

Have you compared the generated GPU assembly from both?


FFT is an extremely bandwidth limited problem, so if most time is taken by one upload by both algorithms, the overall time will be similar. More in-depth analysis of how VkFFT and cuFFT scales with memory clocks and bandwidth can be found here: https://www.reddit.com/r/nvidia/comments/jxlbjs/rtx_3090_ove...

I don't know exactly what cuFFT does differently, but I am fairly certain they use very similar memory layout and algorithms behind their code (judging by execution times only).

What should be the main take from this is that Vulkan allows for similar in performance low-level memory control, while being cross platform and open source. I don't think that SPIR-V is more expressive - bet Nvidia wouldn't allow this. But it doesn't prohibit it from still being good.


Do I understand your benchmark plots correctly?

Using the single precision at 1k FFT size as my example.

~165,000 kB/ms performance

Converts to 165,000 MB/s performance

Divide by 8 to convert to complex samples, so 20,625 M complex samples per second.

Divide by 1k to get FFT count of ~20.14M FFT/IFFTs per second?

These benchmarks also include transfer time to and from the GPU?


1k FFT size in single precision is 1024 x 2 x sizeof(float) = 8KB. If we don't think that it won't utilize full GPU (not even one compute unit) and assume that it scales similarly to big systems then: 1)165GB/s is an algorithmic bandwidth of benchmark, including consecutive FFT+iFFT. Both of them take one upload and one download from chip - total 4 memory transfers. The real bandwidth for this value will be 4*165=660GB/s. 2)one FFT is 2 transfers - upload and download. Total 16KB. 3)660GB/s / 16KB = 43M iterations per second. Similar to your number, but your number didn't account that benchmark has 4 uploads instead of 2.

These benchmarks don't include transfers to and from GPU, as those are done with PCI-E bandwidth (30GB/s) which is really slow compared to VRAM-chip bandwidth (>500GB/s). This is why it is important to have enough VRAM and avoid CPU communications as much as possible.


My son hit that PCIe limit going the other direction, trying to get FFTW data on to the screen. The project:

https://github.com/kevinacahalan/piano_waterfall

With his motherboard it was impossible to keep both FFT views scrolling at full speed if they were large. He ended up creating a circular buffer in video memory so that he would be able to reduce the PCIe traffic to just the fresh new edge of the data. The fix doesn't work everywhere. Virtualization seems to break it, including with a Chromebook.

Is VkFFT a reasonable tool for attacking this problem? How difficult might it be to get the FFT result into the needed color component, gamma corrected and scaled, with all the other components?


In VkFFT FFT is done as a part of the so called compute queue. There is no need to transfer data through PCI to show it on screen - there is a graphics queue that has this as its main purpose. You can use compute queue results in it and do all necessary calculations directly on the GPU. This type of task can benifit greatly from doing FFT on GPU. VkFFT can be append FFT to the user command buffer (list of commands), which can be then followed by additional user commands, like scaling, correcting, etc.

AFAIK, virtualization software right now has no or very limited access to the GPU, so no extensive graphics can be done through it.


Thanks for the reply! I apologize, I could have looked up the PCIE bandwidth and answered part of my question, it didn't occur to me.

Do you know an application (other than 2-D/3D transforms) where someone would move data from CPU->VRAM then perform a whole bunch of 1-D FFTs on that memory? If any given complex value is only part of only 1-2 FFTs, then PCI-E bandwidth is the limiting factor.

As just a curiosity, why don't FFT benchmarks (CPU/GPU/otherwise) ever simply plot FFTs per second on the y-axis? That's the number we all care about right? It's always bandwidth (with some nuance as you described) or worse, MFLOPs with some arbitrary scaling factor. Fine for relative performance, but if your not comparing it to my platform, its not that useful unless I measure my platform and convert to the same representation.


Big 1D FFTs also take a lot of memory by themselves (i.e. 2^28 takes 2GB just to store complex data). Multiple smaller batches can be used in ML applications for example for big kernel convolutions. All learning can actually be done without transferring data to CPU. In computational physics iterative processes or PDE integrators can do their algorithms independently of the CPU.

About using simple time per iteration as a benchmark. I used to have this type of benchmark before the last update (see: https://raw.githubusercontent.com/DTolm/VkFFT/f7c8c45717006c...). As you can see, it is not really that informative, as you can't really compare smaller times to big ones here. The layout I use now doesn't make any assumptions about algorithm yet still provides very informative scaling - just by looking at it it is possible to jusge wether techinques like register overutilization actually work. Another important thing is that it can clearly prove that the problem is bandwidth bound and compare at which size VkFFT/cuFFT swwitch from 1 to 2 and then to 3 stage FFT algorithm. It also allows to detect wether algorithm deviates from predicted result.


> -benchmarked Radeon VII and RTX 3080, shows that FFT is extremely bandwidth limited on modern GPUs

Great to see that!

I expect huge improvements in that area with AMD's new RX series with SAM activated [0].

[0]: https://www.amd.com/en/technologies/smart-access-memory


Actually, it is still best to aim at zero transfers between GPU and CPU during the execution. The GPU is limited by VRAM-chip bandwidth which is much bigger than the PCI-E bandwidth. And it should not be affected by SAM.


Any plans for arbitrary-size transforms? (i.e., not restricted to vectors whose dimension is a power of two)


Yes, this is indeed something I would like to add in the future. While adding different radix kernels support for small prime factors is not that hard, writing efficient scheduler is a much more challenging task (each sequence, even for power of 2 now is split differently targeting different architectures to optimize performance).

The Bluestein's algorithm typically used for arbitrary prime sizes requires both zero-padding and convolutions support which are already efficiently implemented, so it is also not completely out of reach.


why did you choose MPL 2.0?


It is a great open-source license for library projects. For example, Eigen uses it: https://eigen.tuxfamily.org/index.php?title=News:Relicensing...!


Can someone ELI5 what this library is useful for?


So far there have been two ways to to heavy compute tasks on GPUs: CUDA (Nvidia only) and OpenCL (all vendors). Nvidia invested a lot in software and toolchains to make CUDA the go to option for many projects (especially in the machine learning community). Meanwhile OpenCL is falling apart and sees less and less support and updates.

However, the Vulkan API which is also supported by most vendors (except Apple where you have to use a compatibility layer called MoltenVK) is gaining traction in the compute sector. If you trust the benchmarks, then this library here is showing that you can get a similar performance out of Vulkan compute than what you would expect from CUDA. It is just that this library only provides a very small fraction of the features of what the CUDA ecosystem does, so the Vulkan compute ecosystem still has a lot catching up to do.

Edit: In case it is not obvious from the title, the library is used to calculate the https://en.wikipedia.org/wiki/Fast_Fourier_transform


> Meanwhile OpenCL is falling apart and sees less and less support and updates.

I think this view is too pessimistic. In fact, support either gets better (Intel oneAPI, Microsoft CLonD3D12, AMD ROCm, Mesa NIR-clover, …) or is unchanged but still maintained (NVIDIA). Moreover, Khronos noticed that OpenCL 2.x was a dead end and was to start over from a point that all vendors could agree on.


> Moreover, Khronos noticed that OpenCL 2.x was a dead end and was to start over from a point that all vendors could agree on.

I don't know much about OpenCL and this statement confused me. Are you saying that OpenCL 2.x is dead, which seems to contradict what you just said? Unless there's a newer version that isn't dead - is there an OpenCL 3.x? Or is Kronos a code name for OpenCL 3?


There is an OpenCL 3.0. The changes were basically "make everything in OpenCL 2.0 that was not widely implemented optional."

> Looking to reset the ecosystem, as the group likes to call it, today Khronos is revealing OpenCL 3.0, the latest version of their compute API. Taking some hard earned (and hard learned) lessons to heart, the group is turning back the clock on OpenCL, reverting the core API to a fork of OpenCL 1.2.

> As a result, everything developed as part of OpenCL 2.x has now become optional: vendors can (and generally will) continue to support those features, but those features are no longer required for compliance with the core specification.

https://www.anandtech.com/show/15746/opencl-30-announced-hit...



OK, I see. So they were referring to OpenCL 3.x, and Kronos is the organisation that created that standard.


The fact that oneAPI depends on OpenCL on some platforms is mostly an implementation detail, though, and isn't generally visible to applications, which are written in SYCL/DPC++. The current works to implement oneAPI for AMD and Nvidia hardware don't use OpenCL (hipSYCL and sycl-cuda). IMO, that some people think oneAPI is the future is an indication that OpenCL is "falling apart," not evidence that it is improving.

Vendors which have had decent OpenCL implementations for a long time (i.e. Intel) will implement oneAPI on top of OpenCL, others will not (if they implement oneAPI at all).


oneAPI is not being run on top of OpenCL; it's being run on top of Level Zero. Although I think the CPU runtime is still using the OpenCL driver for the moment.


That depends if you use the OpenCL SYCL backend or the Level Zero SYCL backend. The latter has only existed for a few months.

but yeah, I think the default backend switched to level zero recently, so strikethrough the last sentence of my comment - apparently nobody will focus on opencl as a oneAPI backend.


I'm fascinated, and at the same time slightly troubled, by your usage of the word "compute".


I'm sure he is referring to "Vulcan Compute" as analogous to OpenGL compute [shaders].


Why?

Parts of the industry focused on computing data as opposed to storing it, or retrieving it. Do you think there aren’t people who do this or do you use another word?


It's a weird word. Grammatically, it is a verb. But people use it as a noun (and sometimes even as an adjective). This usage seems relatively modern, the same concept was called differently just a few years ago.


It's an annoying but very common habit, especially but not exclusively in US English.

You might be more familiar with the habit of referring to "an invite" instead of "an invitation" or "a quote" instead of "a quotation", which fit the same pattern of "compute" instead of "computation". There's even "the big reveal" instead of "the big revelation" - using "revelation" would likely confuse many people, who would assume it was exclusively religious.


I'm more familiar with the practice of verbing nouns and adjectives, which is very common (a famous Calvin&Hobbes strip ends with the punchline "verbing weirds language"). The opposite practice of nouning verbs has a very different feeling; I'd say it's almost pedantic, or very dry; not as lively as verbing, for sure.


I think it’s just short for computation rather than being ‘nouning’.


I’m don’t write C++, but isn’t the code extremely messy? Also it appears to be C++ and not C like the “Read me” says.


The library only includes vkFFT.h file (in C) and a set of shaders (C-like language compiled to SPIR-V). Vulkan_FFT.cpp is only an example that shows how VkFFT can be used. It also contains the benchmark in it, but it is not a part of the library.


Aah, okay, I where a little confused about the Vulkan_FFT.cpp. It seemed a little weird to have everything in the .h file, and not just the functions you want to expose in the library.

Again, I know no C++ and a very limited amount of C, so don’t put to much value in my comment. You seem to be very fond of switch statements, consider not stuffing to much code into each case. It make the flow hard to follow. Break the case code into functions and call those.

You have a switch with 40 cases, to load the SPIR-V. I feel like there’s a better way to deal with that. Maybe just a strict naming convention, so having the ID is enough to locate the file.

Impressive work in anycase.


There are surely many different ways to get the job done. VkFFT.h file by itself desn't do any computations btw - it is more like a configurator that launches shaders (sth similar to kernels in CUDA). Having only one header also makes it easier to distribute the library as a code.

I just like the switches as in glsl(shaders) compiler can optimize and unroll them, if the key is provided as a constant. This can provide a non-negligible algorithm speed increase. So over the course of development I came to the point that it is not hard to follow the switches logic for me.


Thank you for explaining. It make sense, given the nature of the library, that one would pick the method or coding style that boost performance.


Rendering a triangle in Vulkan will make you cry.


With OpenGL you draw a triangle, and eventually write a pipeline.

With Vulkan you write a pipeline, and eventually draw a triangle.


Best explanation of the two APIs I've ever heard.

OpenGL "hello triangle" is short only if you cut corners. If you do it the way you'd do in a production app, you're not that far off from the lines of code it takes to do it in Vulkan. It's still less, but on the same order of magnitude.


OpenGL gives you a Toyota, Vulkan gives you the parts to a Ferrari


That's because Vulkan is designed to render millions of triangles, not one.

You wouldn't use Vulkan to render a single triangle for the same reasons why you wouldn't use a helicopter to get a bottle of milk from your local shop.


What's your point? So will writing your own custom UI toolkit, or manually doing your own font rendering, or implementing a custom equivalent to TensorFlow, or writing your own implementation of the C standard library, or ...

If you don't need low level control, you should be using middleware or a full blown engine (Godot, Unity, Unreal, etc).


Eh, it's an investment.

Debugging your first segfault will also make you cry, but it's good for you. It builds character, and prepares you for the more insidious segfaults that are lurking out in the tall grasses.


This is why you use a game/rendering framework/engine if you want to render a single triangle to play around.

This kind of "hardcore requirements to render a single triangle" is what is needed if you wish to have AAA game titles with realistic graphics at 60fps as it gives developers the freedom to do anything they wish.

More freedom -> better graphics at higher performance but more difficult to draw a single triangle.

Less freedom -> easier for beginners to draw a triangle but that is irrelevant in practice because that is not what those APIs are for.


wasn't that same joke made about OpenGL? plus ça change :)


OpenGL 1 was very straightforward, and if you want, you can still use it.

Recent version of OpenGL are more difficult. But compared to OpenGL, Vulkan is in an entirely different class of difficulty.


I wouldn't call it difficulty, or even complexity. Verbose is more apt.

It's daunting for a beginner but for a seasoned graphics programmer it is fairly straightforward. But still really verbose.


It's not straightforward even for seasoned graphics programmers. It's definitely easier for them because they have good grasp on advanced concepts such as barriers, image layouts, synchronization (things you don't really have to deal with in OpenGL/DX11 style APIs), but even they make hard to debug mistakes in AAA games/engines.


Plus ça reste pareil! Plus, it's not even that hard, and if you really find it that hard you can always use a wrapper library.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: