It's slightly exaggerated; the Python program might not have been able to fully utilize all cores, it's really just 16 cores with hyperthreading. But it's not unreasonable: 150x speed-up isn't unexpected when going from Python to C/Rust/C++ in number crunching code, 150/16=9.3 (16 is based on the assumption that the gains from hyperthreading and the losses from imperfect parallelism more or less cancel out)
I don't think I have the code for these large-ish data processing experiments I did any more, but it would be fun to make some toy problems with large amounts of data and create comparable Python and C implementations and create a blog post with the results.
Can we assume that you weren't able to use Numpy here, or at least that your inner loops weren't using it? It can be faster than C++ when you don't happen to know all the optimizations the Numpy library writers knew.
Yeah, I'm just talking about normal Python code here. If you're able to express your problem such that numpy or scipy or pytorch or NLTK or some other C/Fortran library does all the number crunching, Python's performance is less of an issue
Are you exaggerating? If not, can you share a bit more?