PyWhat: Identify Anything

acidbaseextract · on June 16, 2021

Some more great probabilistic python libraries:

https://github.com/datamade/usaddress - "usaddress is a Python library for parsing unstructured address strings into address components, using advanced NLP methods."

https://github.com/datamade/probablepeople - "probablepeople is a python library for parsing unstructured romanized name or company strings into components, using advanced NLP methods."

ok123456 · on June 16, 2021

https://github.com/chardet/chardet - Detects the most likely encoding of a raw byte string.

cge · on June 16, 2021

Note that for the usaddress library, as I was surprised that it failed spectacularly when I played with it: the 'us' in the name appears to refer to the US, not 'unstructured'. There's no note of this in the readme, though there is a small US flag emoji in the Github about string.

ssivark · on June 16, 2021

Nice! In the same spirit, here’s an interesting talk on using Gen.jl (a probabilistic programming library/framework) for cleaning messy data in tables: https://youtu.be/vUxrtqY84AM

nerdponx · on June 16, 2021

I have used and benefited tremendously from both of these libraries. While the methods are sound, the training data they used is not that comprehensive. He will probably want to apply some heuristic clean up before and after processing. Or if your organization has a lot of time and money, add additional training data.

cosmic_quanta · on June 16, 2021

In the same vague theme of "I don't know what I'm dealing with" : https://github.com/ajalt/fuckitpy

0-_-0 · on June 16, 2021

I like the Versioning section:

The web devs tell me that fuckit's versioning scheme is confusing, and that I should use "Semitic Versioning" instead. So starting with fuckit version ה.ג.א, package versions will use Hebrew Numerals.

doubleunplussed · on June 17, 2021

I was disappointed that it's actually 4.8.1 in the setup.py

antongribok · on June 16, 2021

I can't decide what I'm more impressed with:

The 110% code coverage, the downloads per month, or the license.

bee_rider · on June 16, 2021

I'm not sure if it was intentional or not, but I love that the Hebrew characters that they found look visually similar to Nan.

dangrie158 · on June 17, 2021

Read the test.py then ;)

kilnr · on June 16, 2021

Another one sort of related is hachoir, and specifically the hachoir-metadata script: https://github.com/vstinner/hachoir

belval · on June 17, 2021

> Still getting errors? Chain fuckit calls. This module is like violence: if it doesn't work, you just need more of it.

From the README. Jokes aside it seems like something I could actually have a use for.

ehsankia · on June 17, 2021

Didn't know there was a python version, but as the README says, this is based on the classic fuckitjs: https://github.com/mattdiamond/fuckitjs

lettergram · on June 16, 2021

We built a similar tool, utilizing a CNN. It works on structured (and unstructured) data and provides additional info.

https://github.com/capitalone/DataProfiler

Cool part, is you can “extend” the intern name-entity recognition model by refitting with the new data.

Out if the box, the DataProfiler does something like 18 entities including most of the PII dada.

cecilpl2 · on June 16, 2021

Cool, but it seems like 80% of the results in your example demos are Youtube video IDs.

Mogzol · on June 16, 2021

I find it kind of funny that they would choose to show those as demos when it's obvious that most of them really aren't Youtube video IDs. Like "Accept-Lang" is pretty obviously not actually a video ID, even if it matches the [A-Za-z0-9_-]{11} pattern and technically could be a valid ID.

On the other hand, I don't know how you would actually verify whether an 11-character string is or isn't a Youtube ID (short of querying Youtube itself), so I suppose it's nice that potential IDs are shown, just seems they have a very high chance of being false positives.

meowface · on June 16, 2021

You can reduce false positives by trying to identify base64-seeming strings that are 11 characters long. Above a certain amount of entropy and uppercase/lowercase/digit distribution, etc. You might risk false negatives, but different flags for different levels of sensitivity could help with that.

bbno4 · on June 20, 2021

Author here, that's because I made the gifs like weeks before the actual program was made and I am too lazy to spend an hour making gifs again

ehsankia · on June 17, 2021

At least from the config I see that the rarity for it is set pretty low (0.2), so you can filter our the low rarity stuff. I would probably run it by default with like --rarity 0.5 or something.

lapp0 · on June 16, 2021

Why would I need this when I already have a full Tome of Identify with 50 charges?

saas_sam · on June 16, 2021

PyWhat only uses one inventory slot vs. 2 for Tome. That's one extra SoJ!

nknealk · on June 16, 2021

Tome of identify only holds 20 charges

AbraKdabra · on June 16, 2021

I'm pretty sure he's playing the Project Diablo II mod.

mkl · on June 16, 2021

Why are these screenshots animated? The command is still visible in the final frame, and the final frame shows the output we're interested in, but not long enough to read and understand it.

bbno4 · on June 20, 2021

that's a good point, i'll take it into consideration thanks :)

mgraczyk · on June 17, 2021

Somewhat odd result for s3.amazonaws.com:

    ~> python3 -m pywhat s3.amazonaws.com
    
    Possible Identification
    ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
    ┃ Matched Text     ┃ Identified as        ┃ Description ┃
    ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
    │ s3.amazonaws.com │ JSON Web Token (JWT) │ None        │
    └──────────────────┴──────────────────────┴─────────────┘

atymic · on June 17, 2021

I guess it's matching for 3 blobs separated by dots (ala JWT). Should probably check length and then verify the JWT.

mgraczyk · on June 17, 2021

Something like that, `python3 -m pywhat 1.x.` gives the same result

vitus · on June 16, 2021

I'm admittedly not impressed by the pcap processing.

It identifies a bunch of fragments of HTTP headers as "YouTube Video ID".

Meanwhile, I can get the same info and more by running

    $ strings FollowTheLeader.pcap
    *]?>
    GET / HTTP/1.1
    Host: 10.0.2.5
    User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0
    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
    Accept-Language: en-US,en;q=0.5
    Accept-Encoding: gzip, deflate
    Connection: keep-alive
    Upgrade-Insecure-Requests: 1
    Pragma: no-cache
    Cache-Control: no-cache
    HTTP/1.0 200 OK
    Server: SimpleHTTP/0.6 Python/3.7.3rc1
    Date: Sun, 14 Jul 2019 02:42:13 GMT
    Content-type: text/html
    Content-Length: 105
    Last-Modified: Sun, 14 Jul 2019 02:41:10 GMT
    <h1>My Flag Web Page</h1>
    <p>Hi there! Have a flag!</p>
    <p>Here is your flag: ctfa{terrific_traffic}</p>

dec0dedab0de · on June 16, 2021

At first I thought this was going to be like google lens. It's instead a way to probabilistically Identify things in strings. I have wished for this to exist, and made my own dumbed down version of it before. This could be very useful for less fragile screen scraping.

_rra8 · on June 17, 2021

Good program! I think? it's can bi useful in OSINT, or many more things!

bbno4 · on June 20, 2021

haha yup!! I love OSINT :)

iab · on June 16, 2021

Has anyone tried using this on the GIMBAL/GOFAST UAP videos?

MrYellowP · on June 17, 2021

I am confused and amazed at the same time.

What is this sorcery?

gigatexal · on June 16, 2021

There really is a Python module for everything.

rainonmoon · on June 17, 2021

Bee is a really tremendous and generous developer. I use a few of their other projects near-daily (Rustscan especially has changed my life.) Definitely one of those open source devs you follow just to see whatever they come up with next.

bbno4 · on June 20, 2021

Hey! Bee here :))) Thank you so much for the kind words!!! I hope to make things that help people in their lives :)