Claude Plays Pokémon

dang · 2025-02-26T20:51:44 1740603104

Related ongoing thread:

Show HN: LLM plays Pokémon (open sourced) - https://news.ycombinator.com/item?id=43187231

Philpax · 2025-02-25T16:50:12 1740502212

This is truly tremendous to watch. Eleven years from TPP, and we're watching the current best-in-class AI try its best at the same. Who'll get there first, the historical gestalt of Twitch users or the just-shy-of-10^26 FLOPS [0] AI model?

Now here's a concept for anyone with more money than sense: ClaudePlaysTwitchPlaysPokemon, where it's TPP but every participant is Claude. Would hivemind AI consensus perform better than a single AI? Anthropic's certainly looking into it! [1]

[0]: https://www.oneusefulthing.org/p/a-new-generation-of-ais-cla...

[1]: https://www.anthropic.com/news/visible-extended-thinking

IX-103 · 2025-02-27T14:24:58 1740666298

A few years ago there was another AI that tried to beat Pokemon. It wasn't a LLM. I think it was an LSTM trained with reinforcement learning. It got stuck in Mt Moon.

Right now, Claude has been stuck in Mt Moon for nearly a day. It keeps forgetting where it has been. It also almost always runs from battles instead of changing Pokemon or fighting.

At one point it got stuck in a Pokemon center when it mistook the character's red hat for the red carpet around the exit. It kept pressing down and wondering why it wasn't working. It only broke out of that when it mistakenly concluded it had successfully exited the Pokemon center. Then it wandered around a bit and only realized it was still in the Pokemon center after talking to Nurse Joy.

Philpax · 2025-02-27T15:56:03 1740671763

You're thinking of https://www.youtube.com/watch?v=DcYLT37ImBY and https://github.com/PWhiddy/PokemonRedExperiments.

> It also almost always runs from battles instead of changing Pokemon or fighting.

I believe this is because all of its Pokemon are on the verge of fainting, so it's trying to conserve them while it tries to find its way out.

> It keeps forgetting where it has been.

I'm wondering if this could be solved with a better harness; on one hand, that hurts the elegance of having one model dedicated to playing the game, but their existing harness is already cheating a little (they have a second LLM for verification). They're frequently compacting what's in context, which means its visual memory is quite poor - that could potentially be a point of improvement?

Y_Y · 2025-02-27T00:42:57 1740616977

Or the converse, feed all of twitch chat to Claude and see if it can output the correct button presses.

unification_fan · 2025-02-27T05:08:35 1740632915

You'd have to feed it all of Twitch chat correlated to whatever frame was being streamed at the time and adjusted for network jitter and buffering.

Good luck

_--__--__ · 2025-02-26T21:30:24 1740605424

This is neat but watching a reasoning model that stops to consider "I have read half of a dialogue block, time to press A to get the rest of the text" gets old really quick. I think I'd rather watch a model try to play pokemon against human opponents on a simulator like pokemon showdown (which I understand is a bit further in an IP rights grey area than emulating a 30 year old game). In that case you would get to see how it handles unknown information and updates its reasoning based on the success/failure of its predictions.

lcnPylGDnU4H9OF · 2025-02-26T22:09:16 1740607756

> play pokemon against human opponents on a simulator like pokemon showdown

That's precisely the pet project I'd take on if/when I bother to take the time making some deep learning agent. There's a bot that plays one of the ladders already but it's just a decision tree and the best players know how to predict its moves. It's like ~1500 ELO in a ladder where the best players are 1800+. Still not bad, to be fair; it would probably beat me.

The bot has a pre-selected team, which I believe always starts with the same mon. I'd be more interested in an agent that fully played the game, start-to-finish, including making a team based on play data and selecting a starter based on the current opponent's team.

northern-lights · 2025-02-26T22:14:35 1740608075

A model doesn't need to play on visual simulators, it can very well do that on IRC (like the good old days of RS/GSBots), to show how it fares against humans.

One of the biggest challenge this Claude version faces is to read the visual data accurately. It was stuck in the Viridian forest and Pokemarts for a while because the overworld objects like trees and paths kept confusing it.

alexchantavy · 2025-02-27T00:49:31 1740617371

Haha yeah this is cool but the days of watching Twitch Plays Pokemon or RNG Plays Pokemon or things like that were much more entertaining

unification_fan · 2025-02-27T05:09:47 1740632987

> which I understand is a bit further in an IP rights grey area than emulating a 30 year old game

But Nintendo will never take down anything that is related to Showdown because it would highlight their massive hypocrisy!

It would set a precedent. People would go: "wait, but why did they never take down Showdown itself? Could it be that it's because they actually benefit from its existence? Then why did they take down X/Y/Z? Oh! It's because copyright law only applies when you want it to! It's all arbitrary and made up! You just need to be friends with the right people in the VGC and your pet project will be immune from all legal backlash!"

Or something.

Seriously I hate it so fucking much that Nintendo does nothing about Showdown, which blatantly steals a ton of game assets, and then nukes some random guy's fan project that no one ever played.

Philpax · 2025-02-25T17:05:21 1740503121

It's run by Anthropic! https://x.com/AnthropicAI/status/1894419011569344978

falcor84 · 2025-02-25T23:47:16 1740527236

This administrative request for a reset is incredible - I can't help but feel that this is intended as the equivalent of a prompt injection for the person running it. Time to rewatch Ex Machina.

https://x.com/AnthropicAI/status/1894419017756029427?t=xDXk6...

tehsauce · 2025-02-27T00:01:39 1740614499

Anyone interested in watching lots of reinforcement agents playing pokemon red at once, we have a website which streams hundreds of concurrent games from multiple people’s training runs to a shared map in real time!

https://pwhiddy.github.io/pokerl-map-viz/

(works best on desktop)

sunaookami · 2025-02-25T20:41:26 1740516086

I like that it named the rival "Waclaude" :)

wanderer2323 · 2025-02-26T22:04:09 1740607449

What is the significance of this?

_--__--__ · 2025-02-26T22:07:34 1740607654

https://en.wikipedia.org/wiki/Waluigi_effect

adenta · 2025-02-26T22:37:37 1740609457

I think the “wa” prefix means “bad” in Japan

Izkata · 2025-02-27T04:27:21 1740630441

There's a bit of a... pun(?) in there with its apparent origin as a name in Waluigi: The word is "waru" (noun) or "warui" (adjective), and with the "l" / "r" thing with Japanese pronunciation, "warui" and "luigi" combine really well.

Granted Wario was first in that franchise.

sunaookami · 2025-02-27T08:23:54 1740644634

Correct. Another pun is that "ruigi" (類義) means "similar".

skoll43 · 2025-02-26T21:32:30 1740605550

not beating the copyright allegations

meltyness · 2025-03-01T14:48:39 1740840519

I can't look at the current state of this and without wondering if it's tokenizer-dyslexia. I wonder if AI performance growth has been borrowed from overfitting and pruning the tokenizer of invalid sequences and leakage the entire corpus, a cardinal sin of making valid predictions.

TheAceOfHearts · 2025-02-27T05:37:41 1740634661

Watching the moment to moment is pretty boring, but it might be interesting if someone puts together highlights of interesting events and moments. The screenshot where Claude asks for the game to restart is absolutely charming.

j_timberlake · 2025-02-27T17:53:24 1740678804

This would be a really cool category of speed-running. "How fast can a model beat a game that it's never played before?"

First get the model to beat a game, then work on better decision-making, then try to speed up the decision-making. Then repeat when better models come out.