In hlb-CIFAR10 the MaxPooling ops are the slowest kernels now and take longer than all 7 of the convolution operations combined, if I understand correctly.
Memory-bound operations seem to rather consistently be the limiting factor in my personal ML research work, at least. It can be rather annoying!
Memory-bound operations seem to rather consistently be the limiting factor in my personal ML research work, at least. It can be rather annoying!