Hacker News new | past | comments | ask | show | jobs | submit login
The super effectiveness of Pokémon embeddings using only raw JSON and images (minimaxir.com)
171 points by minimaxir 85 days ago | hide | past | favorite | 28 comments



Very nice! This took me about 30 minutes to re-implement for Magic: The Gathering cards (with data from mtgjson.com), and then about 40 minutes or so to create the embeddings. It does rather well at finding similar cards for when you want more than a 4-of, or of course for Commander. That's quite useful for weirder effects where one doesn't have the common options memorized!


I was thinking about redoing this with Magic cards too (I have quite a lot of code for that preprocessing that data already) so it's good to know it works there too! :)


There seem to be a lot of properties that are numeric or boolean, e.g.

    "base_happiness": 50,
    "capture_rate": 190,
    "forms_switchable": false,
    "gender_rate": 4,
    "has_gender_differences": true,
    "hatch_counter": 10,
    "is_baby": false,
    "is_legendary": false,
    "is_mythical": false,
Why not treat each of those properties as an extra dimension, and have the embedding model handle only the remaining (non-numeric) fields?

Is it because:

A) It's easier to just embed everything, or

B) Treating those numeric fields as separate dimensions would mean their interactions wouldn't be considered (without PCA), or

C) Something else?


Because then you couldn't use a pretrained LLM to give you the embeddings. If you added these numerics as extra dimensions, you would need to train a new model that somehow learns the meaning of those extra dimensions based on some measure.


The embedding model outputs a vector, which is a list of floats. If we wrap the embedding model with a function that adds a few extra dimensions (one for each of these numeric variables, perhaps compressed into the range zero to one) then we would end up with vectors that have a few extra dimensions (e.g. 800 dimensions instead of 784 dimensions). Vector similarity should still just work, no?


I would be interested in how this might work with just looking for common words between the text fields of the JSON file weighted by e.g. TF-IDF or BM25.

I wonder if you might get similar results. Also would be interested in the comperative computation resources it takes. Encoding takes a lot of resources, but I imagine look-up would be a lot less resource intensive (i.e.: time and/or memory).


Almost everyone uses MiniLM-L6-v2.

You almost certainly don't want to use MiniLM-L6-v2.

MiniLM-L6-V2 is for symmetric search: i.e. documents similar to the query text.

MiniLM-L6-V3 is for asymmetric search: i.e. documents that would have answers to the query text.

This is also an amazing lesson in...something: sentence-transformers spells this out, in their docs, over and over. Except never this directly: i.e. it has a doc on how to make a proper search pipeline, and a doc on the correct model for each type of search, but not a doc saying "hey use this"

And yet, I'd wager there's $100+M invested in vector DB startups who would be surprised to hear it.


It would be nice if you spelled out on your post how you know this, then. Is it written somewhere? A relevant paper for example?


> it has a doc on how to make a proper search pipeline, and a doc on the correct model for each type of search, but not a doc saying "hey use this"


> minimaxir uses Embeddings!

> It’s super effective!

> minimaxir obtains HN13


Nice article. I remember the original work. Can you elaborate on this one Max? > Even if the generative AI industry crashes


It's a note that embeddings R&D is orthogonal to whatever happens with generative AI even though both involve LLMs.

I'm not saying that generative AI will crash but if it's indeed at the top of the S-curve there could be issues, notwithstanding the cost and legal issues that are only increasing.


While there is no real definition of LLM I’m not sure I would say both involve LLMs. There is a trend towards using the hidden state of an LLM as an embedding but this is relatively recent, and overkill for most use-cases. Plenty of embedding models are not large, and it’s fairly trivial to train a small domain-specific embedding model that has incredible utility.


To some approximation, if you understood what BERT was at the time it was released, you'd consider it the first modern LLM. GPT-1 was OpenAI's BERT.

Timeline would be viewed as:

2017: transformers

2018: bert

2018: gpt-1

2019: gpt-2

2020: gpt-3

2022: gpt-3.5 (chatgpt)


BERT was not large (it was under a billion parameters). And it wasn't an autoregressive language model like GPT


Relative to today's models, it is small.

For the time, it was large.

Re: that it's not auto regressive, that's correct.

Things built on eachother smoothly. Transformers to BERT to GPT.


I think the author is implying that even if you can't extract real world value from generative AI, the current AI hype has evolved embeddings to a point they can provide real world value to a lot of projects (like the semantic search demonstrated in the article, where no generative AI was used).


Nice.

Can you compare distances just like that on a 2D space post-UMAP?

I was under the impression that UMAP makes metrics meaningless.


Great post, really enjoyed the flow of narrative and quality deep technical details


arceus being as close as rampardos to mew is kinda funny


> man + women - king = queen

Useless correction, it's king - man, not man - king.


It's also woman not women


Also you hear that example over and over again because you can't get other ones to work reliably with Word2Vec; you'd have thought you could train a good classifier for color words or nouns or something like that if it worked but actually you can't.

Because it could not tell the difference between word senses I think Word2Vec introduced as many false positives as true positive, BERT was the revolution we needed.

I use similar embedding models for classification and it is great to see improvements in this space.


The other example that worked for me with Word2Vec was Germany + Paris - France = Berlin: https://simonwillison.net/2023/Oct/23/embeddings/#exploring-...


There are a bunch of these things in a word2vec space. I had a blog post years ago on my group's blog which trained word2vec on a bunch of wikias so we could find out who is the Han Solo of Doctor Who (which I think somewhat inexplicably was Rory Williams). You need to carefully implement word2vec, and then the similarity search, but there are plenty of vaguely interesting things in there once you do.


It's a good point about true and false positives though, which makes me wonder if anyone's taken a large database of expected outputs from such "equations" and used it to calculate validation scores for different models in terms of precision and recall.


really cool read!


Great article - thanks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: