https://github.com/datamade/usaddress - "usaddress is a Python library for parsing unstructured address strings into address components, using advanced NLP methods."
https://github.com/datamade/probablepeople - "probablepeople is a python library for parsing unstructured romanized name or company strings into components, using advanced NLP methods."
Note that for the usaddress library, as I was surprised that it failed spectacularly when I played with it: the 'us' in the name appears to refer to the US, not 'unstructured'. There's no note of this in the readme, though there is a small US flag emoji in the Github about string.
Nice! In the same spirit, here’s an interesting talk on using Gen.jl (a probabilistic programming library/framework) for cleaning messy data in tables: https://youtu.be/vUxrtqY84AM
I have used and benefited tremendously from both of these libraries. While the methods are sound, the training data they used is not that comprehensive. He will probably want to apply some heuristic clean up before and after processing. Or if your organization has a lot of time and money, add additional training data.
The web devs tell me that fuckit's versioning scheme is confusing, and that I should use "Semitic Versioning" instead. So starting with fuckit version ה.ג.א, package versions will use Hebrew Numerals.
I find it kind of funny that they would choose to show those as demos when it's obvious that most of them really aren't Youtube video IDs. Like "Accept-Lang" is pretty obviously not actually a video ID, even if it matches the [A-Za-z0-9_-]{11} pattern and technically could be a valid ID.
On the other hand, I don't know how you would actually verify whether an 11-character string is or isn't a Youtube ID (short of querying Youtube itself), so I suppose it's nice that potential IDs are shown, just seems they have a very high chance of being false positives.
You can reduce false positives by trying to identify base64-seeming strings that are 11 characters long. Above a certain amount of entropy and uppercase/lowercase/digit distribution, etc. You might risk false negatives, but different flags for different levels of sensitivity could help with that.
At least from the config I see that the rarity for it is set pretty low (0.2), so you can filter our the low rarity stuff. I would probably run it by default with like --rarity 0.5 or something.
Why are these screenshots animated? The command is still visible in the final frame, and the final frame shows the output we're interested in, but not long enough to read and understand it.
At first I thought this was going to be like google lens. It's instead a way to probabilistically Identify things in strings. I have wished for this to exist, and made my own dumbed down version of it before. This could be very useful for less fragile screen scraping.
Bee is a really tremendous and generous developer. I use a few of their other projects near-daily (Rustscan especially has changed my life.) Definitely one of those open source devs you follow just to see whatever they come up with next.
https://github.com/datamade/usaddress - "usaddress is a Python library for parsing unstructured address strings into address components, using advanced NLP methods."
https://github.com/datamade/probablepeople - "probablepeople is a python library for parsing unstructured romanized name or company strings into components, using advanced NLP methods."