Speed in benchmarks [0] looks impressive. But they seem to be run on the old version of Arraymancer (Arraymancer 0.2.90). Were there any speed improvements/regressions in the newer version(s)?
Is there any "production code" written in Arraymancer?
The code didn't change so there shouldn't be any difference on Nim stable. Unfortunately I catched a regression in Nim #devel branch itself which is very impacting if tensors are created in a tight loop.
All in all I take performance very seriously and regularly check the assembly generated and the memory overhead of the library. It should at least be as fast as any C, C++ library even if they are optimized by a compiler (Tensorflow, Tensor Comprehension/Halide or MxNet/TVM), my intermediate language is C, and my optimizing compiler is GCC/LLVM. Critical parts should even reach Fortran speed thanks to heavy usage of __restrict__ and assume_aligned compiler builtins.
I also take great care about how to implement any algorithm for numerical stability and speed for example using a numerically stable 1-pass parallel softmax_cross_entropy through a frobenius inner product that I didn't see in any library I checked (Caffe, Tensorflow, Torch).
Also while deep learning is a focus, I've added general linear algebra and ML features like a least squares solver and PCA.