This is truly tremendous to watch. Eleven years from TPP, and we're watching the current best-in-class AI try its best at the same. Who'll get there first, the historical gestalt of Twitch users or the just-shy-of-10^26 FLOPS [0] AI model?
Now here's a concept for anyone with more money than sense: ClaudePlaysTwitchPlaysPokemon, where it's TPP but every participant is Claude. Would hivemind AI consensus perform better than a single AI? Anthropic's certainly looking into it! [1]
A few years ago there was another AI that tried to beat Pokemon. It wasn't a LLM. I think it was an LSTM trained with reinforcement learning. It got stuck in Mt Moon.
Right now, Claude has been stuck in Mt Moon for nearly a day. It keeps forgetting where it has been. It also almost always runs from battles instead of changing Pokemon or fighting.
At one point it got stuck in a Pokemon center when it mistook the character's red hat for the red carpet around the exit. It kept pressing down and wondering why it wasn't working. It only broke out of that when it mistakenly concluded it had successfully exited the Pokemon center. Then it wandered around a bit and only realized it was still in the Pokemon center after talking to Nurse Joy.
> It also almost always runs from battles instead of changing Pokemon or fighting.
I believe this is because all of its Pokemon are on the verge of fainting, so it's trying to conserve them while it tries to find its way out.
> It keeps forgetting where it has been.
I'm wondering if this could be solved with a better harness; on one hand, that hurts the elegance of having one model dedicated to playing the game, but their existing harness is already cheating a little (they have a second LLM for verification). They're frequently compacting what's in context, which means its visual memory is quite poor - that could potentially be a point of improvement?
This is neat but watching a reasoning model that stops to consider "I have read half of a dialogue block, time to press A to get the rest of the text" gets old really quick. I think I'd rather watch a model try to play pokemon against human opponents on a simulator like pokemon showdown (which I understand is a bit further in an IP rights grey area than emulating a 30 year old game). In that case you would get to see how it handles unknown information and updates its reasoning based on the success/failure of its predictions.
> play pokemon against human opponents on a simulator like pokemon showdown
That's precisely the pet project I'd take on if/when I bother to take the time making some deep learning agent. There's a bot that plays one of the ladders already but it's just a decision tree and the best players know how to predict its moves. It's like ~1500 ELO in a ladder where the best players are 1800+. Still not bad, to be fair; it would probably beat me.
The bot has a pre-selected team, which I believe always starts with the same mon. I'd be more interested in an agent that fully played the game, start-to-finish, including making a team based on play data and selecting a starter based on the current opponent's team.
A model doesn't need to play on visual simulators, it can very well do that on IRC (like the good old days of RS/GSBots), to show how it fares against humans.
One of the biggest challenge this Claude version faces is to read the visual data accurately. It was stuck in the Viridian forest and Pokemarts for a while because the overworld objects like trees and paths kept confusing it.
> which I understand is a bit further in an IP rights grey area than emulating a 30 year old game
But Nintendo will never take down anything that is related to Showdown because it would highlight their massive hypocrisy!
It would set a precedent. People would go: "wait, but why did they never take down Showdown itself? Could it be that it's because they actually benefit from its existence? Then why did they take down X/Y/Z? Oh! It's because copyright law only applies when you want it to! It's all arbitrary and made up! You just need to be friends with the right people in the VGC and your pet project will be immune from all legal backlash!"
Or something.
Seriously I hate it so fucking much that Nintendo does nothing about Showdown, which blatantly steals a ton of game assets, and then nukes some random guy's fan project that no one ever played.
This administrative request for a reset is incredible - I can't help but feel that this is intended as the equivalent of a prompt injection for the person running it. Time to rewatch Ex Machina.
Anyone interested in watching lots of reinforcement agents playing pokemon red at once, we have a website which streams hundreds of concurrent games from multiple people’s training runs to a shared map in real time!
There's a bit of a... pun(?) in there with its apparent origin as a name in Waluigi: The word is "waru" (noun) or "warui" (adjective), and with the "l" / "r" thing with Japanese pronunciation, "warui" and "luigi" combine really well.
I can't look at the current state of this and without wondering if it's tokenizer-dyslexia. I wonder if AI performance growth has been borrowed from overfitting and pruning the tokenizer of invalid sequences and leakage the entire corpus, a cardinal sin of making valid predictions.
Watching the moment to moment is pretty boring, but it might be interesting if someone puts together highlights of interesting events and moments. The screenshot where Claude asks for the game to restart is absolutely charming.
This would be a really cool category of speed-running. "How fast can a model beat a game that it's never played before?"
First get the model to beat a game, then work on better decision-making, then try to speed up the decision-making. Then repeat when better models come out.
Show HN: LLM plays Pokémon (open sourced) - https://news.ycombinator.com/item?id=43187231
reply