Hacker News new | past | comments | ask | show | jobs | submit login

It’s not. The fine tuning taught the LLM how to give single-character responses (move/fire keyboard controls) in response to a sequence of ASCII-art-ized frames of the game being played.



Is it actually ASCII art or just a textual encoding? The art representation is nice for looking under the hood and seeing something pretty, but I feel like that is a very far from optimal way to textually encode Doom for a language model to process. Especially since there is no pitching the camera, you can encode all of the information you need to represent a frame in a single line of ASCII. It they are actually using an ASCII art representation, I bet they would get way better performance encoding the frame as a single line of text.


I never realised you could encode each column of Doom as a single character, but of course you can! I suppose the one thing missing would be distance, but if you get 8 bits per character I you could reserve the upper bits to represent approximate distance.

That's weirdly inspiring! What other games can I make where the visuals are conceptually no more than a line of characters, but which can get macroexpanded into immersive graphics?


Another point to note is that you aren't stuck with a single character to encode a column of Doom as text. You could also do something like a letter to represent the content, followed by a number to represent the distance.

I think the only weird part about that is that certain letter-number pairs may be a single token with some other semantics in the model, and other letter-number pairs would be a pair of tokens. I think that could impact the performance of the model (but probably not by a huge amount).


I suppose the save states of a game are a compressed representation of the world to a degree.


If you just click through the links you’ll see the actual input to the LLM https://twitter.com/SammieAtman/status/1772075251297550457

Nothing you are saying is technically incorrect. But, optimal performance was not the goal. The goal was to see if this crazy stupid concept would actually work. And, it does!


Ah, I think I clicked the actual post link and saw nothing, and backed out. Thanks for the direct link to the video.

And yeah I totally get not aiming for optimal performance. I think it would be interesting to see how a language model could perform with a format that is less visually catered though. Like, textually there is little association between columns, it's just a string of characters, and some of them happen to be newlime characters. A more densely packed encoding would play more into the logic and reasoning encoded into the model, rather than just trying to parse out ASCII art.


That’s so dang cool




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: