Hacker News new | past | comments | ask | show | jobs | submit login
Wikipedia search-by-vibes through millions of pages offline (leebutterman.com)
325 points by gardenfelder on Sept 1, 2023 | hide | past | favorite | 59 comments



This is certainly very interesting.

Unfortunately, I tried describing a few terms across philosophy and psychology and for all of them, the entry I was aiming for was only around the ~20th rank. (Far more popular but less accurate items were populated above it -- e.g. no matter what I typed trying to define a specific modality of psychotherapy, "psychotherapy" was always the #1 result.)

In contrast, I've used ChatGPT to identify the names of certain niche subfields when I couldn't remember what they were called, and it was right every time.

I love the idea of an AI service specifically designed to identify the names of things from descriptions. But I don't think restricting it to Wikipedia (or Wikipedia page titles) is the right approach, and it seems like general-purpose LLM's are doing a great job.

Still, as a proof of concept and as something you can run locally in the browser, this is extremely cool.


Thanks! The goal was to demo the database engine and show off how everything can work airgapped (after the browser downloads everything). I think there’s a lot of parameters to tune (Use just the first paragraph of the article, or everything? Search within some short distance of a particular article?) and I haven’t yet.

Wikipedia is a great demo dataset, and I’m definitely up for adding more datasets. Specifically, just like iPhoto lets you search “mountain” and you can get pictures with mountains, might be cool to search with some multi modal models like CLIP on various datasets


I have found myself describing ideas and goals, and getting back a field, or rather the name of it and certain keywords to look for. It seems that LLMs are the best fuzzy search engines and work in a rather unique though possibly complementary way to traditional search engines.


I like the concept, but I'm not having much luck. I entered "weird looking monkey", hoping to get Proboscis, golden snub nosed, etc, but I just end up with the articles "Pet monkey","List of individual monkeys", "Ethnoprimatology", "Monkey".

Whereas when I type the same query into google, I get exactly what I expected. Which is kind of disappointing, I was hoping to find out about some weird looking monkeys that I didn't know about.


Yeah it’s an off the shelf sentence-transformer model from over a year ago. The demo was more to show off the embedding database, but the embeddings themselves are slightly useful too.

I don’t keep any analytics on the page about what people find and don’t find, so I haven’t set myself up to improve the search results :/


FWIW sentence-transformers truncates the input to at most 256 tokens by default; you might just be embedding the first paragraph or so.


I average the embeddings of every 512 bytes of the page text


That might actually be making things worse for longer articles. It probably would be better would be to index them separately and aggregate back to article level post query


Ah ok, that makes sense


Wikipedia editors/guidelines are generally not in favor of "opinionated" adjectives, and the use of "weird looking" in your query sounds a lot like something that would be frowned upon on a wikipedia article. This makes it hard for your search to retrieve good results on this corpus of knowledge.


It's also largely dependent on the embeddings model as others have mentioned. Even if wikipedia doesn't have any words specifically referring to that monkey as "weird", the model itself would know to correlate this monkey's embeddings with the "weird" concept. The main issue with this particular implementation is the model used (all-minilm-l6-v2) which is designed for speed and efficiency over accuracy.


Couldn't someone pretty easily merge it with something more informal?

For instance, Reddit data dump (https://academictorrents.com/details/7c0645c94321311bb05bd87...), filter for Wikipedia links, include a context of the thread, combine that with the contents of the article


Really nice implementation! And it's so cool to be able to do this offline. The embeddings aren't quite there yet.

One trick that might be helpful is to embed only the defining (usually the first) sentence or paragraph of the Wikipedia article, rather than the whole document -- not clear to me which portion you're using now.

My own site, OneLook, has had a similar feature (https://onelook.com/thesaurus/) since '03 that lets you find words and concepts by description. It was a pure reverse-dictionary search back when I started, but over the past two decades I've explored word embeddings, then sentence embeddings, and more recently LLMs. Nowadays it uses GPT to generate some guesses for inputs that it can't answer itself.

LLMs are so much better than earlier methods at this task, it's taken some of the wind out of my sails on improving this aspect of OneLook. I frequently hear from people for whom reverse-definition lookups are the main reason they use ChatGPT!


A little late to the party here, but text embeddings (at least the ones used in this blog post) generally aren't very good at "searching by vibes": they more compare by overlapping words or look for similar content to the search query.

However, there is a recent paper that actually does try and do this: "Retrieving Texts based on Abstract Descriptions" (Ravfogel et al., 2023) https://arxiv.org/abs/2305.12517.

They give many examples of searching by vibes: "an architect designing a building", "a company which is part of another company", "a book that influenced the development of a genre", etc. etc. Their embeddings apparently facilitate this type of search much better. Would be interesting to retry the offline Wikipedia search from the linked post with this new type of embeddings.


The page is currently failing to work for me, because `model_quantized.onnx` isn't loading -- I'm watching it and it has currently managed to get 19.2MB downloaded (at ~50KB/sec) as I type this, so if every visitor is triggering that...

I think we may be doing awful things to Lee Butterman's bandwidth bill.


It’s static files on one t2.nano! Who knows.


Looks like it would help to use a (free) CDN for the static files. You could set up a subdomain that provides caching access to the base site.


It's incredibly impressive for what it does, but the results don't seem very good.

Although I know from experience it's really difficuly to assess search result quality by hand, you can be very close to something great and return far worse matches than this does.


Yes! The quality probably isn’t as good as Similar Website Finder https://explore2.marginalia.nu/ ;) and I bet using a more recent sentence embedding would lead to better results, I gotta collect more data


Haha, SWF is indeed unreasonably good (although it does something very differently and with a lot more processing power). Although I think in large part because I do no dimension reduction and brute force the cosine similarity calculation with raw 10,000,000 dimension vectors like the programming caveman I am.


The tech is very impressive but the results are not.

I searched "pointy building in Paris", and got :

Tourism in Paris, Bourse de commerce (Paris), Grands Projets of François Mitterrand, List of tallest buildings and structures in the Paris region, List of tourist attractions in Paris, Palais des congrès de Paris, Landmarks in Paris, Palais de la Bourse, Lyon, Outline of Paris, Architecture of Paris

no mention of the most famous pointy building in Paris...

Maybe sentence embedding of the entire article is not the best thing for this kind of application.


At least 5 of those would have the answer to your question.


But are not _the_ answer to the question.


As mentioned, it searches embeddings of each sentence in the article. Maybe it could return the actual sentence that it found instead of the article title.


If you mean the Eiffel Tower, it's not a building.

I just checked the article, and of the 19 times the word "building" appears, it's mostly a verb, followed by "Chrysler Building"

Unless there's some other famously pointy building I'm not thinking of.


https://www.wikidata.org/wiki/Q243 (the Eiffel Tower, which is so famous it gets an extremely small ID) is an instance of https://www.wikidata.org/wiki/Q1440476 (lattice tower), a subclass of https://www.wikidata.org/wiki/Q12518 (tower) a subclass of https://www.wikidata.org/wiki/Q41176 (building).


I get that, but if this model wasn't trained on wikidata (I don't think it was) this information can not be contained in it. Probably should though, I love wikidata.

I was mostly too excited to trot-out the thing I remembered about why most towers are not buildings.


Part of search engine magic is mixing in embeddings (or traditional information retrieval keywords) of the pages that link to the page weighted by clicks and authority score, for example.

Without this signal, a lot of useful information is ignored and the result doesn't feel as magical.

Still impressive, fascinating demo


I don't know, I wanted to like this, but I didn't get any relevant matches for any of the searches I tried:

* "The wizard in The Lord of the Rings": No Gandalf or Saruman, only books about LOTR and such.

* "Protagonist of Scorsese's Taxi Driver": No Travis Bickle.

* "A person that plants trees for a living": Somehow a gardener isn't on the list.

* "Curly-haired painter on TV": No Bob Ross anywhere.

* "Unusually shaped modern art museum in Spain": Bilbao does show up as number 4, but none of the others are unusually shaped.

* "Dog shaped like a sausage": Surely a dachshund should be in the top results.


It's worth noting that every result you wanted here does have a Wikipedia article. (If they hadn't, then their absence wouldn't be as strange.)


“Vibes” is a way more relatable term than “sentence embeddings”. I may need to start using that. :)


If it manages to be "relatable," it does so at the expense of a great deal of precision. OP doesn't explain his choice of wording, and it does not match any usage of "vibe." that I'm familiar with. Was "gist" not trendy enough, I wonder?


The correct term would be “semantic search” but for a guy hand writing ONNX code I think I can assume they know that.


“Gist” to me implies accuracy, and has a meaning in GitHub, whereas the averaged embedding of 512-character chunks of text is more, uh, impressionistic.


Love this demo but as others noted it's really easy to find queries where it performs poorly (e.g. typos).

Looks like the embedding model used (all-minilm-l6-v2) currently ranks 35th on the hugging face leaderboard [0]. I'd love to try with other models if anyone wants to +1 this demo :). This feels like a nice dataset to build intuition around embeddings used for RAG etc.

[0]: https://huggingface.co/spaces/mteb/leaderboard


A production-ready search engine runs off of a lot more than embeddings. They will have special logic to handle all sorts of special cases, as well as reranking models to show the most relevant results at the top. To me this is more of a demo of client-side vector search, which can be useful for other things.


Lemme try a few other embedding models after the weekend :)


Meta comment about the comments saying “I didn’t find what I was looking for” - core fact is it was using Wikipedia as a knowledge base. So if your topic is not represented well enough in Wikipedia it’s not going to return good results. Secondly, comparing something that runs on your mobile device or laptop to Google is like comparing an apple to a container ship full of oranges.


It's really cool, but why doesn't it link to the wikipedia article?


>This is a browser-based search engine for Wikipedia, where you can search for “the reddish tall trees on the san francisco coast” and find results like “Sequoia sempervirens” (a name of a redwood tree).


I searched "wabisabi tastes like legos" and got back "Kiwiana" so I dug into a bunch of New Zealand jangle pop bands.

I think it's worth it.


Are diacritics supported? Searching "écorché" gave no relevant results. Cf. google.[0]

[0]https://www.google.com/search?q=%C3%A9corch%C3%A9+site%3Aen....


The quality of the embeddings is a limiting factor for this sort of search - OpenAI text-ada embeddings are great but that removes the local aspect, and the better huggingface models are too big. With the model sizes increasing it’s hard to see what the path will be for local/offline.


There are plenty of great embedding models that are on the order of a few hundreds megs (even outperforming ada-002). See the leaderboard here - https://huggingface.co/spaces/mteb/leaderboard. Local/offline is only growing.


Wow gte-small feels like a pretty great balance of size and quality (all-MiniLM-L6-v2 has been my go-to)


Thank you for making this.

What's the number means near words in the results? For example "book 70k" what's 70k refer to? The number of discussions on Wikipedia about it? #s of Edits? Or articles that mentioned it?


This is great news for those who suffer from memory recall problems. Hope to see more edge devices handle this inferencing locally.


Headsup: don't start typing into the form expecting to get a result. It's actually just a screenshot in a blog post.


Good offline search could be a major advance in personal privacy.


Wow, so cool to see this here. Happy to answer any questions


Neat to see your stuff show up here!


Having this integrated into Kiwix would be great!


Low-hanging fruit: make article names clickable!


Even lower-hanging fruit: put a space between the word and the rank so I can word select the title to copy-paste it.


Low hanging fruit could be a mountain of effort for those who'd otherwise continue to focus on improving the major feature.


> Low hanging fruit could be a mountain of effort for those who'd otherwise continue to focus on improving the major feature.

"https://en.wikipedia.org/wiki/" + encodeURIComponent(title)

Here you have it. The major feature is in the title: you can hardly call it a Wikipedia search engine if you can’t access the articles.


"https://en.wikipedia.org/wiki/" + encodeURIComponent(title.replace(/ /g, '_'))


That’s better to avoid a redirect, but both work.


[flagged]


We're getting complaints that your account is only posting to link to your own things - that's not allowed on HN and readers here consider it spamming. Please see https://news.ycombinator.com/newsguidelines.html:

"Please don't use HN primarily for promotion. It's ok to post your own stuff part of the time, but the primary use of the site should be for curiosity."

If you want to participate in the intended spirit that's great, and occasionally linking to your own work in contexts where it's relevant is ok, but it shouldn't be your primary use of HN. This is a community for human conversation on topics of intellectual interest, and promotion is not that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: