Hacker News new | past | comments | ask | show | jobs | submit login
PyWhat: Identify Anything (github.com/bee-san)
290 points by trueduke on June 16, 2021 | hide | past | favorite | 38 comments



Some more great probabilistic python libraries:

https://github.com/datamade/usaddress - "usaddress is a Python library for parsing unstructured address strings into address components, using advanced NLP methods."

https://github.com/datamade/probablepeople - "probablepeople is a python library for parsing unstructured romanized name or company strings into components, using advanced NLP methods."


https://github.com/chardet/chardet - Detects the most likely encoding of a raw byte string.


Note that for the usaddress library, as I was surprised that it failed spectacularly when I played with it: the 'us' in the name appears to refer to the US, not 'unstructured'. There's no note of this in the readme, though there is a small US flag emoji in the Github about string.


Nice! In the same spirit, here’s an interesting talk on using Gen.jl (a probabilistic programming library/framework) for cleaning messy data in tables: https://youtu.be/vUxrtqY84AM


I have used and benefited tremendously from both of these libraries. While the methods are sound, the training data they used is not that comprehensive. He will probably want to apply some heuristic clean up before and after processing. Or if your organization has a lot of time and money, add additional training data.


In the same vague theme of "I don't know what I'm dealing with" : https://github.com/ajalt/fuckitpy


I like the Versioning section:

The web devs tell me that fuckit's versioning scheme is confusing, and that I should use "Semitic Versioning" instead. So starting with fuckit version ה.ג.א, package versions will use Hebrew Numerals.


I was disappointed that it's actually 4.8.1 in the setup.py


I can't decide what I'm more impressed with:

The 110% code coverage, the downloads per month, or the license.


I'm not sure if it was intentional or not, but I love that the Hebrew characters that they found look visually similar to Nan.


Read the test.py then ;)


Another one sort of related is hachoir, and specifically the hachoir-metadata script: https://github.com/vstinner/hachoir


> Still getting errors? Chain fuckit calls. This module is like violence: if it doesn't work, you just need more of it.

From the README. Jokes aside it seems like something I could actually have a use for.


Didn't know there was a python version, but as the README says, this is based on the classic fuckitjs: https://github.com/mattdiamond/fuckitjs


We built a similar tool, utilizing a CNN. It works on structured (and unstructured) data and provides additional info.

https://github.com/capitalone/DataProfiler

Cool part, is you can “extend” the intern name-entity recognition model by refitting with the new data.

Out if the box, the DataProfiler does something like 18 entities including most of the PII dada.


Cool, but it seems like 80% of the results in your example demos are Youtube video IDs.


I find it kind of funny that they would choose to show those as demos when it's obvious that most of them really aren't Youtube video IDs. Like "Accept-Lang" is pretty obviously not actually a video ID, even if it matches the [A-Za-z0-9_-]{11} pattern and technically could be a valid ID.

On the other hand, I don't know how you would actually verify whether an 11-character string is or isn't a Youtube ID (short of querying Youtube itself), so I suppose it's nice that potential IDs are shown, just seems they have a very high chance of being false positives.


You can reduce false positives by trying to identify base64-seeming strings that are 11 characters long. Above a certain amount of entropy and uppercase/lowercase/digit distribution, etc. You might risk false negatives, but different flags for different levels of sensitivity could help with that.


Author here, that's because I made the gifs like weeks before the actual program was made and I am too lazy to spend an hour making gifs again


At least from the config I see that the rarity for it is set pretty low (0.2), so you can filter our the low rarity stuff. I would probably run it by default with like --rarity 0.5 or something.


Why would I need this when I already have a full Tome of Identify with 50 charges?


PyWhat only uses one inventory slot vs. 2 for Tome. That's one extra SoJ!


Tome of identify only holds 20 charges


I'm pretty sure he's playing the Project Diablo II mod.


Why are these screenshots animated? The command is still visible in the final frame, and the final frame shows the output we're interested in, but not long enough to read and understand it.


that's a good point, i'll take it into consideration thanks :)


Somewhat odd result for s3.amazonaws.com:

    ~> python3 -m pywhat s3.amazonaws.com
    
    Possible Identification
    ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
    ┃ Matched Text     ┃ Identified as        ┃ Description ┃
    ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
    │ s3.amazonaws.com │ JSON Web Token (JWT) │ None        │
    └──────────────────┴──────────────────────┴─────────────┘


I guess it's matching for 3 blobs separated by dots (ala JWT). Should probably check length and then verify the JWT.


Something like that, `python3 -m pywhat 1.x.` gives the same result


I'm admittedly not impressed by the pcap processing.

It identifies a bunch of fragments of HTTP headers as "YouTube Video ID".

Meanwhile, I can get the same info and more by running

    $ strings FollowTheLeader.pcap
    *]?>
    GET / HTTP/1.1
    Host: 10.0.2.5
    User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0
    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
    Accept-Language: en-US,en;q=0.5
    Accept-Encoding: gzip, deflate
    Connection: keep-alive
    Upgrade-Insecure-Requests: 1
    Pragma: no-cache
    Cache-Control: no-cache
    HTTP/1.0 200 OK
    Server: SimpleHTTP/0.6 Python/3.7.3rc1
    Date: Sun, 14 Jul 2019 02:42:13 GMT
    Content-type: text/html
    Content-Length: 105
    Last-Modified: Sun, 14 Jul 2019 02:41:10 GMT
    <h1>My Flag Web Page</h1>
    <p>Hi there! Have a flag!</p>
    <p>Here is your flag: ctfa{terrific_traffic}</p>


At first I thought this was going to be like google lens. It's instead a way to probabilistically Identify things in strings. I have wished for this to exist, and made my own dumbed down version of it before. This could be very useful for less fragile screen scraping.


Good program! I think? it's can bi useful in OSINT, or many more things!


haha yup!! I love OSINT :)


Has anyone tried using this on the GIMBAL/GOFAST UAP videos?


I am confused and amazed at the same time.

What is this sorcery?


There really is a Python module for everything.


Bee is a really tremendous and generous developer. I use a few of their other projects near-daily (Rustscan especially has changed my life.) Definitely one of those open source devs you follow just to see whatever they come up with next.


Hey! Bee here :))) Thank you so much for the kind words!!! I hope to make things that help people in their lives :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: