Thanks for the tip. I'll see where I can apply it.
Most of the time is usually spent in the convolution layers. Convolution is not a matrix multiplication in the current implementation. I guess it would be a matrix multiplication in frequency domain or by using a Toeplitz matrix.
I've implemented a CPU Parallel version and gave a try at GPU implementation. But I'm not satisfied at all by the GPU version :)
> Convolution is not a matrix multiplication in the current implementation
I figure there's a code re-organisation task since propagating node activations through a layer of weights is essentially a matrix multiplication (fully connected => fully dense matrix).
The optimised routines make use of vectorised CPU instructions and the FMA instruction (fused multiply and add), all of which are perfect fits for [dense] matrix multiplcation. Not so great for sparse matrices, but they help, usually unless it's very sparse it's faster to use a dense matrix format with zeros for the missing weights.
Most of the time is usually spent in the convolution layers. Convolution is not a matrix multiplication in the current implementation. I guess it would be a matrix multiplication in frequency domain or by using a Toeplitz matrix.
I've implemented a CPU Parallel version and gave a try at GPU implementation. But I'm not satisfied at all by the GPU version :)
https://github.com/cbovar/ConvNetSharp/blob/master/src/ConvN...
https://github.com/cbovar/ConvNetSharp/blob/Gpu/src/ConvNetS...
Pull requests more than welcome!