But if your network contains a layer whose linear part performs an (approximate)...

But if your network contains a layer whose linear part performs an (approximate) DFT, you will get an efficiency gain by replacing it with an exact FFT.

You wouldn't want to use an FFT for most CNNs anyway because the kernels have very small support. Convolution with them is O(n) in the spatial domain as long as you recognize the sparsity.