Hacker News new | past | comments | ask | show | jobs | submit login

Maybe I’m missing something but I’ve created vector embeddings for all of English Wikipedia about a dozen times and it costs maybe $10 of compute on Colab, not $5000



This is covering 300+ languages, not just English, and it's specifically using Cohere's Embed v3 embeddings, which are provided as a service and currently priced at US$0.10 per million tokens [1]. I assume if you're running on Colab you're using an open model, and possibly a relatively lighter weight one as well?

[1]: https://cohere.com/pricing


This is pretty early in the game to be relying on proprietary embeddings, don't you think? If if they are 20% better, blink and there will be a new normal.

It's insane to me that someone, this early in the gold rush, would be mining in someone else's mine, so to speak


It’s not just that. Embeddings aren’t magic. If you’re going to be creating embeddings for similarity search, the first thing you need to ask yourself is what makes two vectors similar such that two embeddings should even be close together?

There are a lot of related sources of similarity, but they’re slightly different. And I have no idea what Cohere is doing. Additionally, it’s not clear to me how queries can and should be embedded. Queries are typically much shorter than their associated documents, so they typically need to be trained jointly.

Selling “embeddings as a service” is a bit like selling hashing as a service. There are a lot of different hash functions. Cryptographic hashes, locality sensitive hashes, hashes for checksum, etc.


I'm with you on this. The vector embedding craze seems to be confusing mechanism and problem. The problem is semantic similarity search. One mechanism is vector embedding. I think all this comes from taking LLMs as a given, seeing that they work reasonably well with phrase-input semantic retrieval, and then hyper-optimizing vector embedding / search to achieve it.

Are there other semantic search systems? What happened to the entire field of Information Retrieval - is vector search the only method? Are all the stemming, linguistic analysis, all that - all obsoleted by vectors?

Or is it purely because vector search is quick? That's just an engineering problem. I'm not convinced it's the only method here. Happy to be corrected!


The entire field of information retrieval is still here. This was touched on by the OReilly article on lessons learned working with LLMS that hit the HN front page yesterday [1], in their section on RAG.

My sense is that you can currently break the whole thing down into two groups: the proverbial grownups in the room are typically building pipelines that are still doing it basically how the top-performing systems did in the '90s, with a souped up keyword and metadata search engine for the initial pass and an embedding model for catching some stuff it misses and/or result ranking. This isn't how most general-purpose search engines work, but it's likely how the ones you don't particularly mind using work. Web search, for example.

And then there's the proverbial internet comments section, which wants to skip past all the boring labor-intensive oldschool stuff, and instead just begin and end with approximate nearest neighbors search using an off-the-shelf embedding model. The primary advantage to this approach - and I should admit here that I've tried it myself - is that you can bodge it together over a weekend and have the blog post up by Monday.

I guess what I'm getting at is, the people producing content on the Internet and the people producing effective software aren't necessarily the same people. I mean, heck, look at me, I'm only here to type this comment because I'm slacking off at work today.

1: https://www.oreilly.com/radar/what-we-learned-from-a-year-of...


Your comment makes a lot of sense to me.

What I wonder though is - we've been a year and a half into the LLM craze and we still don't see a really good information processing system for them. Yes, there's chatbots, some that let you throw in images and PDFs.

But what we need is more like a ground-up rethink of these UIs. We need to invent the "desktop" of LLMs.

But the keys here, I think, are that

a) the LLMs are only part of the solution. A chat interface is immature and not enough.

b) external information is brought in by the user, and augmented by a universe of knowledge given by the provider

c) being overly general is probably a trap. Yes, LLMs can talk about everything - but why not solve a concrete vertical?

Semantic search helps with a part of this, but is just one component.


Also, frankly, I don't think a chat interface is good UX. People are having fun with it right now because it's novel. But human-human interaction doesn't use natural language because it's somehow ideal; we rely on it due to hardware limitations. We don't have the same set of limitations in human-computer interaction. And we also have a lot of history (as in, literally all of history) demonstrating that, even when talking to each other, humans quickly start straying away from pure natural language interaction whenever their communication is modulated by a technology that allows for additional options.

You can even see some of this play out a bit over the course of the web's nearly 30 year history. 20 years ago, informational websites tended to be brief, highly structured, and minimally chatty. Nowadays, people produce walls of text that you have to dig through to find the actual content. Why the change? Search engine optimization. Which I'd argue is an example of essentially the same folks who give us AI basically dragging us back to a world where natural language dominates. Not because it's actually better for anyone, but because it's what they can more easily build a one-size-fits-all algorithm around.


Part of the reason why LLM summaries are so attractive IS a UI problem. The economics of the web has led to every publisher stuffing their websites with ads. No one wants that. Its much nicer to see a clean paragraph of text.

But we clearly have an ouroboros situation. If publishers lose views, they lose money and the ability to craft good information. Less new info to incorporate into LLMs.

LLM training over the internet corpus has really been a massive heist. Pulling a wool over publishers' heads, undercutting their business, hoarding the information.

But it's really unavoidable at this point. Everything has been democratized: compute on cloud platforms, data via Common Crawl, OSS algorithms and tool-kits. No one can put a stop to this, and there's powerful economic incentives to actually get some benefit out of the hundreds of billions that have been poured in already.


> Are there other semantic search systems?

Not a semantic search but stemming + BM25 often works surprisingly well and is a fast and cheap baseline.


You probably won't find out exactly what they're doing any more than anyone's going to find out a whole lot of details on what OpenAI is doing with GPT. But, as the popularity of GPT demonstrates, it seems that many business customers are now comfortable embracing closed models.

There is more information here, though: https://cohere.com/blog/introducing-embed-v3


> here are a lot of different hash functions. Cryptographic hashes, locality sensitive hashes, hashes for checksum, etc.

and there are some standard hash functions in the lib, which cover 98% of usecases. I think the same is embeddings, you can train some foundational multitask model, and embedding will work for variety of tasks too.


> If you’re going to be creating embeddings for similarity search, the first thing you need to ask yourself is what makes two vectors similar such that two embeddings should even be close together?

I have no association with Cohere, but in their docs clearly say that their embedding were trained so two similar vectors have similar "semantic meaning". Which is still pretty vague, but it's at least clear what their goals were.

> Selling “embeddings as a service” is a bit like selling hashing as a service.

Coincidentally, Cohere also aggressively advertises that they want you to fine-tune and co-develop custom models (with their proprietary services).


But this is the GPs point — that doesn’t mean they’re optimized for retrieval.


I have no idea. But that wasn't the question I was answering. It was, "how does the article's author estimate that would cost $5000?" And I think that's how. Or at least, that gets to a number that's in the same ballpark as what the author was suggesting.

That said, first guess, if you do want to evaluate Cohere embeddings for a commercial application, using this dataset could be a decent basis for a lower-cost spike.


Yes, that is how I came up with that number.


Ah didn’t realize it was every language. Yes I’m using a light weight open model - but also my use case doesn’t require anything super heavy weight. Wikipedia articles are very feature-dense and differentiable from one another. It doesn’t require a massive feature vector to create meaningful embeddings.


it's 35M 1024 vectors Plus the text


Still, $5000 is kind of insane. That makes no sense to me.


I also don’t quite understand the value of embedding all languages into the same database. If I search for “dog” do I really need to see the same article 300 times?

As a first step they are using PQ anyways. It seems natural to just assume all English docs have the same centroid and search that subspace with hnswlib.


It's split by language. TFA builds an index on the English language subset.


Ah, missed that.


hey, 3 cents cheaper than text-embedding-3-large (without batching)!

Are there some benchmarks available that compare it with the openai model?


Did you do use the same method, i.e. split by chunks each article and vectorize each chunk?


That's the only way to do it. You can't index the whole thing. The challenge is chunking. There are several different algorithms to chunk content for vectorization with different pros and cons.


You can do much bigger chunks with models that support RoPE embeddings, such as nomic-embed-text-1.5 which has a 8192 context length: https://huggingface.co/nomic-ai/nomic-embed-text-v1.5

In theory this would be an efficiency boost but the performance math can be tricky.


As far as I understand it, context length degrades llm performance, so just because an llm "supports" a large context length it basically just clips a top and bottom chunk and skips over the middle bits.


Why would you want chunks that big for vector search? Wouldn't there be too much information in each chunk, making it harder to match a query to a concept within the chunk?


The problem is that often semantic meaning depends on state multiple paragraphs or sections away.

This is a coarse way to tackle that


Yes


Also, if you’re spending $5000 to compute embeddings, why are you indexing them on a laptop?


He's not though because cohere stuck the already embedded dataset on huggingface https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-em...


Got any details?


Nothing too crazy, just downloading a dump, splitting it into manageable batch sizes, and using a lightweight embedding model to vectorize each article. Using the best GPU available on colab it takes maybe 8 hours if I remember correctly? Vectors can be saved as NPY files and loaded into something like FAISS for fast querying.


This probably deserves its own article and might be of interest to the HN community.


I will probably make a post when I launch my app! For now I’m trying to figure out how I can host the whole system for cheap because I don’t anticipate generating much revenue


I'm interested in hearing about what you will be hosting.

Would Digital Ocean or Hetzner meet your needs?


I was going to use ec2 and s3, should I look at digital ocean or hetzner instead?


What is the end task(e.g. RAG, or just vector search for question answering), are you satisfied with results in terms of quality?


The end result is a recommendation algorithm, so basically just vector similarity search (with a bunch of other logic too ofc). The quality is great, and if anything a little bit of underfitting is desirable to avoid the “we see you bought a toilet seat, here’s 50 other toilet seats you might like” effect.


Did you chunk the articles? If so, in what way?


Yes. I split the text into sentence and append sentences to a chunk until the max context window is reached. The context window size is dynamic for each article so that each chunk is roughly the same size. Then I just do a mean pool of the chunks for each article.


Thanks for the answer.

Wikipedia has a lot of tables so I was wondering if content-aware sentence chunking would be good enough for Wikipedia.

https://www.pinecone.io/learn/chunking-strategies/


mwparserfromhell can parse the text content without including tables


How big is the resulting vector data?


Like 8 gb roughly


Do you have a link to the notebook?


No haha just a rats nest of a bunch of notebooks


How?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: