I used chatGPT to decode proprietary binary files of some industrial machinery. It was amazing how it can decipher shit and find patterns. It first looked for ascii characters, then byte sequences acting as delimiters, then it started looking at which bytes could be the length or what 4-bytes could be floating point numbers of coordinates and which endianness was more logic for coordinates, etc. etc. crazy stuff.
That sounds amazing. Shame it's proprietary, I'd love to read that chat transcript. do you just paste binary data in and ask it to decipher it? or do you ask it leading questions? or...?
This is cool, though it did make a mistake while converting hex number to decimal (0x132004 = 1253380, not 1249284). Proof reading this can be a big pain. It can detect those patterns out of a long piece string like nothing, yet it fails at basic conversion, which is really beyond me.
Yes, I tried it for this bin file and it didn't go as deep as stock gpt4. It wrote some python code to parse the file, but it was hard to have a long conversation with it regarding the data. It was always jumping into writing python before the brainstorming finished (could be a feature not a bug) ;)
I'm looking to reverse engineer some file format in order to implement and editor for that file format (proprietary file format, undocumented but AFAIK not encrypted), would it be possible to use that program for that purpose? Is there another free tool for that purpose?
That’s a very generic question, hard to tell without extra details, but I find it useful against decoding hashes or at least giving clues oh how to decode it.
I don't buy this. LLMs are basically just fancy text completion based on training data. "Binary data from a proprietary industrial machine" sounds like the furthest possible thing that could have been in the training data. How can you possibly trust its output if it's not something it's ever seen before?
The only reason I say this is because I have tried. I asked an LLM to decode a variety of base64 strings, and every single time, it said the deocded ASCII was "Hello, world!"
This doesn't come as a surprise to me. Unless it was trained on a dataset that included a mapping of every base64-encoded character, it's just going to pattern-complete on sequences of base64-encoded-like characters and assume it translates to "Hello, world!" from some programming tutorial it was trained on.
That's still kinda cool. Now I'm curious if it can decode all the figlet fonts too. Size can be controlled with HTML as some are easier to read visually by a human if smaller
[Edit] - This might makes ones eyes bleed but I am curious if it can read this [1]. If installing figlet type showfigfonts to see examples of all the installed fonts. More can be installed [2] in /usr/share/figlet/fonts/
That kind of decoding is a bit different though. For one, the tokenization process makes encodings difficult to handle (unless it’s trained on a lot of pairs).
This would be more akin to asking ChatGPT to help build a black box parser for base64, not asking it to decode it itself.
GPT4 can absolutely decode base64. Early jailbreaks were to base64 a python-based jailbreak to get it to output whatever you wanted and later OpenAI added a patch to filter base64 outputs to follow their rules.
Some of the input data was known yes, because this software has a gui and it outputs a binary file based on user data (PCB Bill of materials)+internal machine settings. So i knew there were some coordinates and ascii data in there and GPT helped find the delimiters, etc. Some things i was also able to figure out with Ghidra and lots of trial and error.