Hacker News new | past | comments | ask | show | jobs | submit login

Great suggestions, looking into this right now. First time building something like this so definitely new to some of these tools.

For scraping: Found that every Shopify store has a public JSON file that is available in the same route. The JSON file appears on the [Base URL]/products.json. For example, the store for Wild Fox has their JSON file available here: https://www.wildfox.com/products.json.

Built a crawler in simple Javascript to run through a list that I bought on a site called "Built With", access their JSON file with the product listing data, and scrape the exact data we want for Agora. Then storing it in Mongo and, currently, using Mongo Atlas Search (i.e. saw they released Vector Search but haven't looked at it). It has been a process of trial and error to pick the right data fields that are required for the front-end experience but not wanting to increase the size of the data set drastically. And after initially using React, switched to NextJS to make it easier to structure URLs of each product listing page.

Mongo will run me about $1,500 / month at the current CPU level. AWS all in will be about $700. I'm currently not storing the image files, so that reduces the cost as well.

A few improvement that has helped so far:

- Having 2 separate Search Indexes, one for the 'brand' and on for the 'product'. There's a second public JSON file that is available on all Shopify stores with relevant store data at [Base URL]/meta.json For example: https://wildfox.com/meta.json

- Removing the "tags" that are provided by store owners on Shopify. I believe these are placed for SEO reasons. These were 1 - 50 words / product so removing these reduced the data size we're dealing with. The tradeoff is that they can't be used to improve the search experience now.

Hope this helps. Still wrapping my head around all of this.




2.2k/mo right off the bat is pretty steep, especially if you're paying that while the search response reliably takes over 10 seconds.

Why would you shovel 1.5k into MongoDB's pockets right off the bat? Especially when ElasticSearch is much better suited to what you're trying to do?


Sounds like someone drank the Mongo kool-aid. You absolutely do not need Mongo, let alone Mongo Atlas. 25 million documents with ecommeece products is measly and should fit in a single 600 GB server


Probably not even that - 25mil is nothing really. A normalised schema in an RDBMS would handle that without sweating.


You could run this entire stack (yes, even for 25 million products) using Kubernetes in a $40/month Linode + Elasticsearch + Cloudflare free plan.


If you're already on AWS, I recommend switching to postgres for now. For context, I have 3 RDS instances, each multi zone, with the biggest instance storing several billion records. My total bill for all 3 last month was $661.

Postgres has full text search, vector search, and jsonb. With jsonb you can store and index json documents like you would in Mongo.

- https://www.postgresql.org/docs/current/textsearch.html - https://aws.amazon.com/about-aws/whats-new/2023/05/amazon-rd...


You can even do Elastic-level full text search in Postgres with pg_bm25 (disclaimer: I am one of the makers of pg_bm25). Postgres truly rules, agree on the rec :)


I have troubles seeing how this is possible.

$220 dollars per instance gets you 8Gb of RAM which is way, way, below the index size if you are indexing billions of vectors.


how big is the disk for the biggest instance?


Pretty small still at 500gb. It only stores hot data right now and a subset of what's important. Most of our data is in S3.


Disclaimer: I am building https://pricetracker.wtf

You may want to look at Hetzner, and cut your costs by about 90%.

Feel free to reach me, email in profile.


In your footer you have a lot of links like "kitchenaid price tracker" and "best buy price tracker". Have these helped links helped?


hey! this is cool, I take it you are based in the US?

How long have you been working on this?


On and off for a year, with more time allocated since June. Yes I am in California.


I’ll second the comments that $2k/month is alarmingly high, especially for the performance that you seem to be getting. When I shoved ~40M webpages into a stock ElasticSearch instance running on a 2013-era server I bought for $200 (on eBay), it handled the load when I hit the HN front page just fine. Either you’re being drastically overcharged or there’s something horribly inefficient in your setup that could probably be tweaked fairly easily to bring your prices down.


I'm biased, but I'd recommend exploring Typesense for search.

It's an open source alternative to Algolia + Pinecone, optimized for speed (since it's in-memory) and an out-of-the-box dev experience. E-commerce is also a very common use-case I see among our users.

Here's a live demo with 32M songs: https://songs-search.typesense.org/

Disclaimer: I work on Typesense.


I can also highly recommend TypeSense and have no affiliation. You'll save a lot of money and get much faster results.


You’re spending $2k/mo run this?? Holy hell.


> I'm currently not storing the image files, so that reduces the cost as well.

I wonder if someone catches on and replaces all your image URLs to the fuzzy testicle egg cup[0], will that negatively impact reputation?

0: http://i.imgur.com/32R3qLv.png


I index 40M paragraphs of legal text, bm25 and vector similarity search, at < 200ms query time, on a single $80/month Hetzner server. Email in profile if you’d like to talk.


>Mongo will run me about $1,500 / month at the current CPU level. AWS all in will be about $700. I'm currently not storing the image files, so that reduces the cost as well.

It will probably cost you just $100 to rent a server from Hetzner and do the same thing. I would also use Redis or another kind of cache to hit the DB less.


Take a look at TypeSense. Faster, better filtering, and much much cheaper if you’re going the cloud version


Sounds like you used an incorrect instance type/size on Atlas


> site called "Built With",

Do you have Alink. And are they any good?



I specifically asked the author if he could add some extra info on Builtwith.

I can Google. But then I don't know if its truly the site the author was talking about. And I certainly don't know his or her insights on that site.


Berkes wanted to do good by sharing a provision with the OP, in case he/she buys something at builtwith.

We all know how to Google. :)


managed elastic search could slash your cost by an order at least


Oh... no... $1500/mo?


Yo fuck mongo just use RDS or some digitalocean DB. Or really just use opensearch/elasticsearch, or even typesense (don't bother with raft it's so broken) or meilisearch


We’ve interacted before on Twitter and GitHub, and I want to address your point about Raft in Typesense since you mention it explicitly:

I can confidently say that Raft in Typesense is NOT broken.

We run thousands of clusters on Typesense Cloud serving close to 2 Billion searches per month, reliably.

We have airlines using us, a few national retailers with 100s of physical stores in their POS systems, logistic companies for scheduling, food delivery apps, large entertainment sites, etc - collectively these are use cases where a downtime of even an hour could cause millions of dollars in loss. And we power these reliably on Typesense Cloud, using Raft.

For an n-node cluster, the Raft protocol only guarantees auto-recovery for a failure of up to (n-1)/2 nodes. Beyond that, manual intervention is needed. This is by design to prevent a split brain situation. This not a Typesense thing, but a Raft protocol thing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: