Are you refering to internal bandwidth? Because CPUs have roughly the same external memory bandwidth, and GPUs have about an order of magnitude higher external bandwidth than FPGAs.
Both internal and external. The larger FPGA parts can host quite lot of DDR controllers, but what is the most important thing here is that there are HUNDREDS of single-cycle two-port block rams. No CPU or GPU can match this with their pitiful fixed tiny caches.
Xilinx Kintex UltraScale KU040-2FFVA1156E (quoted price: about 2000 USD), a typical high-performance-computing FPGA, can fit at most 3 (maybe fewer) 64-bit DDR4 controllers @ 2400 Mbit/s, for a theoretical peak bandwidth of 57.6 GB/s.
Intel Xeon E5-2670 v3: 68 GB/s [1].
NVidia K40: 288 GB/s.
So you're wrong about external memory bandwidth. Regarding internal bandwidth: True, FPGA block RAM has a huge aggregated bandwidth, but it comes with some limitations.
There are no hard-core SDRAM controllers on the KU040, I'm talking about soft-core controller. The bandwidth is limited by the number of IO pins (what good is a memory controller without access to a memory chip?); SDRAM needs quite a number of signal lines.
What FPGA device are you using? How many memory controllers are on it, what width do they have, and at what frequency are the memory modules operating? What external bandwidth do you actually achieve?
> For the problems I am working with, FPGAs, even the mid-range ones, are far more suitable than the top GPUs.
I believe you, but we're talking about maximum external memory bandwidth, not suitability in general.
Tbh I do not care that much about the external memory bandwidth, I only need internal plus some slow swapping into external DDR (used at most 4 channels so far) - my use cases are solely streaming, therefore tranceivers are more than enough. In some cases even a lowly Spartan6 is able to beat all the shit out of Teslas. Compare hundreds of memory fetches a cycle vs. whatever the pitiful NVidia cache is capable of (and remember that if your load is trashing your cache, you're screwed, no way to fix it if there is no option to pre-scramble your data for a linear access).