Hacker News new | past | comments | ask | show | jobs | submit login

Unfortunately, just this week someone fine-tuned the Mistral-7B LLM to play DOOM :P

https://news.ycombinator.com/item?id=39813174




For very modest definitions of playing. Perhaps it'd be more impressive if they recorded a demo file and let that play back without the realtime overhead? Even so it can only move in forward, back, turn, and fire. And only knows to face away from the wall it's collided with. This is so far below even basic Doom bots that I'd be afraid to call it playing.

The ASCII intermediate interpretation also seems unnecessary and very limiting. But perhaps that's to keep it near realtime, looks like 1 FPS?

And why run on a Mac? Why not a beefy PC with a GPU that can do the calculations faster?

Still, does seem like a fun challenge. Maybe with further tuning or training it can level up


Reminded me of "Growing Living Rat Neurons To Play... DOOM?"

https://www.youtube.com/watch?v=bEXefdbQDjw


any models fine tuned for playing an open src game that is non-GPL so that it can be deployed to the app store for interesting bot play ideas?


How could this possibly be in the training set?


It’s not. The fine tuning taught the LLM how to give single-character responses (move/fire keyboard controls) in response to a sequence of ASCII-art-ized frames of the game being played.


Is it actually ASCII art or just a textual encoding? The art representation is nice for looking under the hood and seeing something pretty, but I feel like that is a very far from optimal way to textually encode Doom for a language model to process. Especially since there is no pitching the camera, you can encode all of the information you need to represent a frame in a single line of ASCII. It they are actually using an ASCII art representation, I bet they would get way better performance encoding the frame as a single line of text.


I never realised you could encode each column of Doom as a single character, but of course you can! I suppose the one thing missing would be distance, but if you get 8 bits per character I you could reserve the upper bits to represent approximate distance.

That's weirdly inspiring! What other games can I make where the visuals are conceptually no more than a line of characters, but which can get macroexpanded into immersive graphics?


Another point to note is that you aren't stuck with a single character to encode a column of Doom as text. You could also do something like a letter to represent the content, followed by a number to represent the distance.

I think the only weird part about that is that certain letter-number pairs may be a single token with some other semantics in the model, and other letter-number pairs would be a pair of tokens. I think that could impact the performance of the model (but probably not by a huge amount).


I suppose the save states of a game are a compressed representation of the world to a degree.


If you just click through the links you’ll see the actual input to the LLM https://twitter.com/SammieAtman/status/1772075251297550457

Nothing you are saying is technically incorrect. But, optimal performance was not the goal. The goal was to see if this crazy stupid concept would actually work. And, it does!


Ah, I think I clicked the actual post link and saw nothing, and backed out. Thanks for the direct link to the video.

And yeah I totally get not aiming for optimal performance. I think it would be interesting to see how a language model could perform with a format that is less visually catered though. Like, textually there is little association between columns, it's just a string of characters, and some of them happen to be newlime characters. A more densely packed encoding would play more into the logic and reasoning encoded into the model, rather than just trying to parse out ASCII art.


That’s so dang cool




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: