Hacker News new | past | comments | ask | show | jobs | submit login

There's definitely a conversion cost. For strings, Python apparently caches the UTF-8 encoded string, so if you _repeatedly_ transfer it to Rust I suspect (but haven't checked) that the cost is much lower.

In general I suspect it's the usual "NumPy arrays are fast, everything else you better be getting a sufficiently large boost from the low-level code to justify conversion".

For the thing I prototyped in Rust, it was wrapping the `ahocorasick` crate which was in fact faster than `pyahocorasick` which is written in C or Cython or something. Both have similar conversion costs, probably, so it came down to "for lots of data the Rust version was faster".




Be sure to use auto configuration to get it to go even faster, depending on your use case: https://docs.rs/aho-corasick/0.7.15/aho_corasick/struct.AhoC...

Or just be sure to enable the DFA option if you can afford it. It looks like the Python library is just the standard NFA algorithm.


Yeah, I was using DFA.

Next step is trying alternative approach, but if that alternative doesn't work I'm going to see about wrapping your package for Python.

Thanks for all your work on it!


Nice! Reach out if there are any problems or if you need something exposed in the API. Looking at the pyahocorasick issue tracker, there are a number of features/bugs that your wrapper package would resolve. :)


NumPy also support conversions without copying. One thing I haven't found good way to bridge between Python is the pandas.DataFrame, it seems to be quite Python focused object and iterating through DataFrame is particularly slow.


Internally Pandas often uses NumPy arrays, especially for numeric data, so might be able to pass things that way in some cases?

E.g. `df["column_name"].values` will you get you a NumPy array.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: