Hacker News new | past | comments | ask | show | jobs | submit login

I believe you could drop 99 of those 100 petabytes and your average user wouldn't notice. The web is full of crap. I don't think you have to match Google's size from the start to be able to compete.



That was my thought as well. Especially if in some way this could run as a "personal" search engine. So I could fine tune it to avoid crawling and indexing certain sites - I really don't need to see anything from CNN, HuffPo, etc. If I have that urge, I can always Google it.


>TeMPOraL wrote: I believe you could drop 99 of those 100 petabytes and your average user wouldn't notice.

A general purpose search engine would need that 99 petabytes of bad webpages to help machine learning algorithms classify the new and unknown web content as good or bad.

>if in some way this could run as a "personal" search engine. So I could fine tune it to avoid crawling and indexing certain sites

What you want sounds more like a "whitelist" of good sites to archive and a blacklist of sites to avoid. With the smaller storage requirements of the whitelist content, build a limited inverted index[1]. I agree that would be useful for a lot of personal uses but it's not really a homemade version of Google. Your method requires post-hoc reasoning and curation. Google's machine learning algorithm can make intelligent rankings on new content of new websites that don't exist yet.

[1]https://en.wikipedia.org/wiki/Inverted_index




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: