Indeed. I have pretty much stopped engaging with the standard dialog repeated ad...

Indeed. I have pretty much stopped engaging with the standard dialog repeated ad-infinitum that goes along the lines of "code the bottleneck in C", "GIL is a non-issue, just use parallel processes".

For some workloads, the latter is actually a good advice, but for my typical use case that does not help. These would be tight'ish loop wrapped around a fork-join. Shared memory handling can be quite clunky in numpy, and if you want to do message passing, the overheads bleed off any advantage that parallalelism ought to have given you. I dont mind the message passing abstraction, just that the overhead for doing it in python/numpy is too much. About the former, one major motivation to use numpy et. al. was to not use C with its explicit indexing over arrays. Its both verbose and error prone.

It is never pleasant to drop into a different language, though it is much much better than how bad it could be, thanks to Swig, Cython, Weave. Contrary to common wisdom I prefer Weave because of its much succincter syntax. In Cython I am back to writing C again but with a different syntax. This is not a criticism of Cython, its an excellent tool and it is much much more pleasant to parallelize from Cython than from Numpy/Python.

Julia looks pretty good. I have one suggestion: The best way to get speed out of Julia is not to write vectorized expressions but to writeout explicit loops. Thats a little unfortunate because though vectorization constructs evolved out of the necessity to avoid loops (which was slow in the older languages), it did have an excellent byproduct of succinct code. Ideally I would like to retain that.