Some more great probabilistic python libraries: https://github.com/datamade/usad...

ok123456 · on June 16, 2021

https://github.com/chardet/chardet - Detects the most likely encoding of a raw byte string.

cge · on June 16, 2021

Note that for the usaddress library, as I was surprised that it failed spectacularly when I played with it: the 'us' in the name appears to refer to the US, not 'unstructured'. There's no note of this in the readme, though there is a small US flag emoji in the Github about string.

ssivark · on June 16, 2021

Nice! In the same spirit, here’s an interesting talk on using Gen.jl (a probabilistic programming library/framework) for cleaning messy data in tables: https://youtu.be/vUxrtqY84AM

nerdponx · on June 16, 2021

I have used and benefited tremendously from both of these libraries. While the methods are sound, the training data they used is not that comprehensive. He will probably want to apply some heuristic clean up before and after processing. Or if your organization has a lot of time and money, add additional training data.