Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Podcastsaver.com – a search engine testbench dressed as a podcast site
16 points by hardwaresofton on Oct 24, 2022 | hide | past | favorite | 11 comments
Hey HN,

I submitted PodcastSaver (https://podcastsaver.com) before but the reason it's interesting now is that I've started converting it into a live search engine test-bench.

I've discussed it a bit here[0], and this idea has been kicking around in my head for a while so I got a chance to do some related writing about it with Supabase[1].

The basic idea is to use a modest piece of the podcast index[2] as a place to test out different new age search engines against each other.

So far there are two engines running:

- Postgres FTS + pg_trgm (tuned -- indices are there, I did some EXPLAINing earlier today to tighten things up, but still all built-in tech)

- Meilisearch (untuned -- just stand it up, give it resources and put in documents)

To that effect, I've added a "nerds" page you should peruse: https://podcastsaver.com/nerds

On that page you can:

- choose your search engine

- choose whether to force disable the cache (obviously... you'd want that, for the results to mean anything, but for regular people surfing the cache is on!)

As far as actually getting the podcast search really good, there is a ton of curation left to do so it's a subpar consumer product still, but it's interesting from at least this angle! Going to add more search engines later but who knows when (this project was supposed to be short!).

I can't add every engine on the huge list of new-age search engines[3], but I can say that I will get to highlighting all of them in Awesome F/OSS[4]... Eventually.

[0]: https://news.ycombinator.com/item?id=33316029

[1]: https://supabase.com/blog/postgres-full-text-search-vs-the-rest

[2]: https://podcastindex.org/

[3]: https://news.ycombinator.com/item?id=33317232

[4]: https://awsmfoss.com




The text are unreadable on the nerds page in Dark Mode.

https://podcastsaver.com/nerds


A cardinal sin!

Apologies I finished that page JUST before posting, will fix.


UPDATE: this is fixed now!


Looks like I didn't give Meilisearch enough lattitude -- It's still ingesting and I didn't give it enough disk:

[2022-10-24T20:04:38Z ERROR meilisearch_lib::tasks::update_loop] an error occurred while processing an update batch: Internal error: Quota exceeded (os error 122)

Looks like I'll also have to re-feed it but I reduced the batch size so hopefully it's a little happier with that.


Hello, I'm the Meilisearch CEO, your issues could be because you sent your data without configuring your index, and it's what I read from your comment. Just change your setting to not index URLs.

Check out the doc: https://docs.meilisearch.com/reference/api/settings.html#upd... With the payload: `["title", "descriptionHTML"]`

It will change everything!


Hey thanks for letting me know! I did add a few fields that weren’t directly searched on, because I wanted to be a bit more fair across other search engines (postgres is holding the whole document and the indexes).

I’m going to change the configuration and see how that goes.

One thing I’d love help with (that would make an awesome recipe section for your docs site) are the best practices around bulk insertion! I couldn’t tell if there was an actual benefit to using addDocuments() vs addDocumentsInBatches().


If you remove the URLs from indexation, it'll generally save a ton of place and will be much, much faster to index. We are thinking about not indexing URLs by default; you can help us by explaining your use case here -> https://github.com/meilisearch/product/discussions/553

Just a detail, if you're making a `du -sh` on your computer, the size on the disk will stay unchanged because we are doing soft deletion ;). Don't worry. It will be physically deleted after a while if you need it in the future.

If you kept the default configuration of Meilisearch, the maximum size of the HTTP payload is 100Mb (for security). You change it here -> https://docs.meilisearch.com/learn/configuration/instance_op...

addDocumentsInBatches() is just an helper to send your big json array into multiple parts, not absolutely sure you'll need it. (Code -> https://github.com/meilisearch/meilisearch-js/blob/807a6d827...)


Thanks! I removed the URLs and now the searchable attributes are only title, description and some author fields!

> Just a detail, if you're making a `du -sh` on your computer, the size on the disk will stay unchanged because we are doing soft deletion ;). Don't worry. It will be physically deleted after a while if you need it in the future.

Ah I was just wildy undershooting the size I gave the PVC! I just gave it much more and it's fine -- right now it's resting around 19Gi of usage, which is actually a bit of a problem considering the data set was only like 4GB or something like that originally. That said, disk is really not an issue so I'll just throw more at it, maybe leave it at 32GB and call it a day (it's around 1.6MM documents out of ~2MM), so shouldn't be too much more.

> If you kept the default configuration of Meilisearch, the maximum size of the HTTP payload is 100Mb (for security). You change it here -> https://docs.meilisearch.com/learn/configuration/instance_op...

Thanks for this, I'll keep this in mind -- so I could actually pass off HUGE chunks to Meilisearch.

It seems like the larger the chunk the more efficient? There didn't seem to be much of a change in how much time it took to work through a chunk of documents, more just that having lots of smaller chunks would go slower. I started off with 10k in a batch, then went to 1k then back to 5k, maybe I should go to 100k docs in a batch and see the performance.

There's a blog post waiting to be written in here...

> addDocumentsInBatches() is just an helper to send your big json array into multiple parts, not absolutely sure you'll need it. (Code -> https://github.com/meilisearch/meilisearch-js/blob/807a6d827...)

Thanks! Was this something someone requested? Is there a tangible benefit (were there some customers that didn't want to split up the payloads themselves)? Because it seems like unnecessary cruft in the API otherwise.


Very interesting project. Do you plan including other search engines such as Elasticsearch, Solr, Pinecone, Vespa?


None of those in particular but see this comment for the ones I do want to add!

https://news.ycombinator.com/item?id=33316029


Clickable link: https://podcastsaver.com




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: