Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: A discovery-focused search engine for Hacker News (trieve.ai)
193 points by skeptrune 5 months ago | hide | past | favorite | 33 comments
We (Nick, Dens, Denzell, Fede, Drew, Aaryan, and Daniel) have been building HN Discovery, a discovery-focused search engine for Hacker News, in our spare time for the past 6 months and are excited to show it! It adds the following features relative to the existing keyword search interface and preserves the existing ones:

- no-JS version (hnnojs.trieve.ai)

- site:{required_site} and site:{negated-site} filters

- public analytics

- LLM generated query suggestions based on random stories

- recommendations

- dense vector semantic search

- SPLADE fulltext search

- RAG AI chat

- order by descendant count

client code (FOSS self-hostable) - https://github.com/devflowinc/trieve-hn-discovery

engine code (BSL source-available) - https://github.com/devflowinc/trieve

There is an extended about page with detailed information on features, how much it costs to run, etc. here - https://hn.trieve.ai/about.




Congrats on building and shipping! I love how "at home" the styling feels.

I searched for `Excel` since that's a topic I care about and tend to follow frequently.

The first ("most relevant") link is a post with 1 point and 0 comments from 2020. The second link has 1 comment and 2 points, from 2017.

The top two links on Algolia's search have about 1000 points each and are way more topical.

I tried to hit "Back" to make this comment, and saw the site broke my browser's navigation. I was forced to right click on "Back" (or spam the back button) to get back here... so not a great experience, overall


- the version without search as you type and JS at https://hnnojs.trieve.ai/ has a more normal back button

- Algolia ranks by points by default while we are ranking by relevance score which is the difference. You can order our results by points if you want to with the order by select component that says "relevance".

We went back and forth on making points sorting default and ended up deciding against it, but maybe we should have. Our thinking was that since it's focused on "discovery" it was worth prioritizing relevance, but I can see how it can feel the result quality isn't as great. HN is really good at highlighting interesting links.

Best fix would have been LTR, but we made incorrect decisions early on which made the rewrite a bit too hard - https://trieve.ai/launching-trieve-hn-discovery/#relevance-q...


Single word searches don’t do well with similarity search out of the box. Although some heuristics could probably help.


Very true


I like the layout. I like the Try with Angolia button. I like the search recommendations.

But when I search for "FreebBSD" I get:

* FreeBSD is an amazing operating system

* FreeBSD Is an Operating System

* 9x FreeBSD – a lesson in poor defaults

In the top 15 results, 10 are duplicates. And none of the articles are interesting.

If I sort by points instead of relevance, nothing has to do with FreeBSD.


Sorting by points now is fixed. When the word is non-english we automatically quote so a query for 'FreeBSD' is transformed into '\"FreeBSD\"'. This means that, when sorting by points, you will no longer see results only containing the "Free" token at the top!

We still are not deduping because I think that's semi-useful for "past" behavior, but I'm happy that the sort functionality is no longer broken.


Yeah. We didn't dedup when we ingested, but probably should have. The hack for this right now would be putting "FreeBSD" in quotes when sorting by points or adjusting the score threshold up in options. We have it set to a very low value by default.


This is impressive! I've frequently encountered challenges with Algolia search not locating specific items, but this appears to offer a much more detailed search capability.

I've bookmarked this site and hope it remains available when I need it, unlike many great Show HN posts that vanish after six months or so.


Glad you found it useful! Fulltext search will almost certainly be up in perpetuity, however, we may drop the semantic index if it doesn't get much usage as that's significantly more expensive to host.


Oh thanks! I had one heck of a difficult time trying to find the author and title of 'the glass bead game' by Herman Hesse. It's pretty hard to find with simple keyword based search.

Though short comments seemed to score a bit too highly IMHO. It took a while to find a query that found the long rambly comment I needed.


We spent quite a bit of time trying to make length normalization better, but there's still a lot of room for improvement. The default behavior was super biased towards longer rambly comments and we may have over-corrected. I appreciate the note.


> $6835.39/month

This seems way higher than I expected. Cloud pricing is out of control when Postgres is already > $500 for a small instance that could be run for a fraction if it wouldn’t be a cloud provider.


Pretty sure we will have to co-locate soon in general. Cloud provider costs are near unsustainable for our business at least.


I’m willing to bet that I can reduce your costs by at least 10x. I’d go so far as to say this thing should be able to handle HN front page traffic at < $300 / month, including all real-time vector search.

That is, if this 6k number is actually true. Part of me (forgive me) is in fact wondering if maybe this is an advertisement for your SaaS and you’re inflating this number to make people think there’s no way they can build a thing like that themselves. But, giving you the benefit of doubt, if you are truly paying this, you’re overspending by more than an order of magnitude. Most likely too many middlemen.

Email is in my profile if you want to talk about it.


It is 100% not made up to make our SaaS more attractive. Our shared SaaS only goes up to 1M vectors, so it's not like it's cheaper anyways. We would currently charge the raw cost + ~20% for us to host something at HN scale. Almost none of the cost is due to serving the traffic; it's all just rented high mem compute instances and GPUs. We could serve ~1k QPS on the current infra.

Our terraform and helm are public in the repo - https://github.com/devflowinc/trieve/tree/main/terraform/gcl...


Good job, we built something similar. Maybe not be ready for a full show hn, but if you want to try: https://hn.garglet.com

Has some additional features:

1. find similar users

2. Negative similarity search

3. Browse user comments from oldest first

4. Flattened comments in reverse chronological order on stories

Edit: show hn thread:

https://news.ycombinator.com/item?id=41404856


OK, so I have a very specific product area that I am both interested in from a discovery perspective and know (modestly) quite a lot about, and just ran a query on it, and I must say, this search engine provided a really interesting and thought-provoking set of results. Good job.


Awesome to hear!


the quality of the search is really good! really useful to see how opinions change over time on HN (try typing in 'rust is').


This is very cool! Since HN's own search doesn't work without JS enabled this is a massive help


Hell yeah. We were super passionate about making sure it would work without JS as well as with.


Trieve is a very impressive achievement, you've managed to slam together every AI buzzword into a semi usable product! It's written in Rust, which makes the entire proposal even more perfect for HN :). Love it.

Some questions (and thanks for the detailed "about" page, it answered several of my initial questions!) ~

Will you re-index and keep updating the system to improve the quality of results, or what is the plan? It'd be awesome to have something more nuanced than Algolia which stays updated in near real time, like Algolia.

How easy or challenging is it to bootstrap / re-index? Is it possible to ingest new data with partial updates to the existing indices, or is a full indexing from zero always required?

Are GPUs strictly necessary? Is it possible to use only CPUs if indexing speed isn't a great concern?

Does it really require a terabyte of working memory to index and serve all of the data for HN? (4x 256GB / 128 CPUs is mentioned in your ops details) This is a lot of resources! Like A LOT!

Have you thought considered indexing other high-quality data sources? For example:

* Lobste.rs (I think you can email them requesting a DB dump, they want disclosures about the intended purpose)

* Slashdot (debatable quality, but goes back 27 years which could be interesting)

* Review sites: Chipsandcheese, TomsHardware, Anandtech, HardOCP

* Lwn, Phoronix

* I'm surely missing other good ones, the discussion in The End of Anandtech article from today mentions a bunch of interesting sources: https://news.ycombinator.com/item?id=41399872

I wonder if getting some of these data sources through CommonCrawl or archive.org would reduce the crawl+parse annoyance?

At some point I want to put together an HN-Awesome-Search page which covers all the custom search indexes HN folks have made over the years.

Thank you!


>managed to slam together every AI buzzword into a semi usable product!

THANK YOU! Initial motivation for us was, to a large extent, making it easier to throw data into an API and test to see if all the latest AI buzzword tech was useful for your problem. Hearing that we've done that in a semi-usable way is energizing!

> re-index and keep updating

Yes. It updates in real-time and we are collecting CTR data and analytics to continually fine-tune. There's also a "Rate Results" button which we are looking at the feedback from to improve.

> partial updates

Partial updates work really well and we are doing it constantly with no impact to performance or uptime. Full indexing from zero is only required if we want to make an index-wide change to improve relevance.

> GPUs strictly necessary? Is it possible to use only CPUs if indexing speed isn't a great concern?

They are not strictly required. But, inference on CPU is going to be 300+ms at minimum which makes the search feel very laggy. Indexing on CPU is actually less painful than search, because it's lot more ok for that to be slow.

> Does it really require a terabyte of working memory to index and serve all of the data for HN? (4x 256GB / 128 CPUs is mentioned in your ops details)

We didn't build our own db engine solution and are running Qdrant for this. We certainly tried to get memory usage down and this is the best we got. Qdrant's calculator says this should only take 128gb's with Dense vectors and binary quantization, but we just didn't find that to be true. However, it may be the case that we can improve this further with time.

> indexing other high-quality data sources

Maybe? I'm somewhat active on lobste.rs, but don't partipate much in the other communities which reduces my interest. HN is really cool since I understand it enough to get a feel for how good the search is which helps us tune things for the overall product.

> HN-Awesome_Search page

You should! Checkout out the last section of our "History of HN Search" blog to get a starting point at least --> https://trieve.ai/history-of-hnsearch/

Tried to be very thorough.


This is insanely fast! How many records is it processing?


38,118,782 as of right now. It was a massive undertaking on the ops side of things. Imo, the most impressive bit is that we eventually made our ingest fast enough to ingest the entire dataset in <24hrs. In February, it took two weeks.

Speeding up ingest made experimenting with different indexing strategies a lot more viable.


This is impressively fast. Well done!


Looks overwhelming on mobile


Going to make a few things hidden on small screen sizes


Thanks for the nojs version!


Of course!!


That’s useful, thanks!


This is nice - I for one appreciate the results not being ranked by points, since some interesting stories don't get any traction.


It's a tough call. I personally agree that default sorting on relevance score is better for discovery. It's easy enough to switch the ordering to points if desired.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: