i have, yes. i can't speak for openblas or mkl, but im familiar with eigen and nalgebra's implementations to some extent
nalgebra doesn't use blocking, so decompositions are handled one column (or row) at a time. this is great for small matrices, but scales poorly for larger ones
eigen uses blocking for most decompositions, other than the eigendecomposition, but they don't have a proper threading framework. the only operation that is properly multithreaded is matrix multiplication using openmp (and the unstable tensor module using a custom thread pool)
I skimmed your post and I wonder if mojo is focusing on such small 512x512 matrices? What is your thinking on generalizing your results for larger matrices?
I think for a compiler it makes sense to focus on small matrix multiplies, which are a building block of larger matrix multiplies anyways. Small matrix multiplies emphasize the compiler/code generation quality. Even vanilla python overhead might be insignificant when gluing small-ish matrix multiplies together to do a big multiply.
There are also "official" ones there: https://github.com/CGAL/cgal-swig-bindings. It does not cover all components, but is performed on a "when there are enough requests for it" basis.
Thanks, this brought back some memories. I did an internship / co-op in software engineering at that site in Rochester while a student. They needed extra help so I worked a few evenings at 1.5 pay alongside the regular factory workers.
reply