Having ported a decent chunk of code to work with Numba, the key is to minimize where you're interfacing with Numba.
You need to be careful of inadvertently introducing new types. Especially, numba doesn't (last I checked) recognize homogenous tuples (Tuple[Foo, ...] in typing) so each new length requires recompilation.
Similarly, every call is doing inference on its arguments, including Jitclass constructors. If you're making calls with a large number of arguments, you may be killing your performance gains even absent compilation.
If you're trying to make code that can run with or without numba, e.g. the same logic may not run in a loop, definitely avoid jitclasses.
All in all, just_temp's remark that you have to "write it like Fortran" is pretty close. The reason I found it worked was that I had a lot of business logic type stuff segregated to an early section that spat out structures that were very regular and primitive. That meant the code that had to be fast was already very Fortran like.
I had considered Numba in the past but it just seemed not worth the overhead. A few talks from this year show that they have really expanded the library, to the point where much of the scientific python stack use it instead of Cython. It can target things like ARM devices and is more flexible in the types it can take (dicts!)
For reference
https://www.youtube.com/watch?v=cR8E70GTO8c
and
https://www.youtube.com/watch?v=6oXedk2tGfk
I think it's rather premature to say that the scientific Python stack is adopting Numba. None of the core projects like SciPy, pandas, and scikit-learn have been willing to swap out Cython for Numba. Cython is still dominant and I don't see that changing anytime soon.
Cognitive. Things like having to strip down abstractions and "write it like Fortran". The fact that it can deal with numpy arrays no problem and can actually deal with more common python objects like dicts means that there is less overhead
There's also Cupy, which is Numpy with CUDA acceleration, a drop in replacement for most of Numpy, that you can easily also use CUDA kernels inside Python and even run Numba functions generated with @numba.cuda.jit.
Really wish they would implement texture memory for CUDA, I used numba initially but switched to pycuda for that feature alone. I gained a 2-3x runtime speed up for a raytracing based simulation.
You need to be careful of inadvertently introducing new types. Especially, numba doesn't (last I checked) recognize homogenous tuples (Tuple[Foo, ...] in typing) so each new length requires recompilation.
Similarly, every call is doing inference on its arguments, including Jitclass constructors. If you're making calls with a large number of arguments, you may be killing your performance gains even absent compilation.
If you're trying to make code that can run with or without numba, e.g. the same logic may not run in a loop, definitely avoid jitclasses.
All in all, just_temp's remark that you have to "write it like Fortran" is pretty close. The reason I found it worked was that I had a lot of business logic type stuff segregated to an early section that spat out structures that were very regular and primitive. That meant the code that had to be fast was already very Fortran like.