Hacker News new | past | comments | ask | show | jobs | submit login

Some more great probabilistic python libraries:

https://github.com/datamade/usaddress - "usaddress is a Python library for parsing unstructured address strings into address components, using advanced NLP methods."

https://github.com/datamade/probablepeople - "probablepeople is a python library for parsing unstructured romanized name or company strings into components, using advanced NLP methods."




https://github.com/chardet/chardet - Detects the most likely encoding of a raw byte string.


Note that for the usaddress library, as I was surprised that it failed spectacularly when I played with it: the 'us' in the name appears to refer to the US, not 'unstructured'. There's no note of this in the readme, though there is a small US flag emoji in the Github about string.


Nice! In the same spirit, here’s an interesting talk on using Gen.jl (a probabilistic programming library/framework) for cleaning messy data in tables: https://youtu.be/vUxrtqY84AM


I have used and benefited tremendously from both of these libraries. While the methods are sound, the training data they used is not that comprehensive. He will probably want to apply some heuristic clean up before and after processing. Or if your organization has a lot of time and money, add additional training data.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: