NumPy certainly does not have anything like "Fortran kernels".
Elementwise arithmetic is coded in C (and not very fast...). Linear algebra uses the BLAS and LAPACK interfaces, of these only LAPACK is usually written in Fortran these days... but anyway, a Fortran matrix is a C transposed matrix and vice versa and these interfaces all support transpose as an argument, so there is no need for reordering. And if you do np.dot(A.T, A) there is no reordering either, just a change of flags to DGEMM.
There may not be an absolute need in terms of ability to perform operations, but performance is heavily dependant on accessing in the correct order - otherwise your accessible memory bandwidth goes down the drain, as do cache hitrates.
I'm guessing LAPACK might not be too dependant on order though, but I haven't studied its performance tbh.
Algorithms to handle what you talk about efficiently is one of the primary reason these libraries even exist. OF COURSE they access memory in the right order; that is their job.
But it is not really about right vs wrong order, but using an algorithm to tile the data efficiently. Google "Anatomy of High Performance Matrix Multiplication" for examples from OpenBLAS.
This is where NumPy falls through -- "a.T + a" will be very slow no matter what you do, but it can be done efficiently with tiling.
Interesting. How does it decide the storage order when initializing an array? I'd assume due to python's dynamic nature it can't optimize for later operations there, right?