I believe you could drop 99 of those 100 petabytes and your average user wouldn'...

jkaunisv1 · on Oct 28, 2016

That was my thought as well. Especially if in some way this could run as a "personal" search engine. So I could fine tune it to avoid crawling and indexing certain sites - I really don't need to see anything from CNN, HuffPo, etc. If I have that urge, I can always Google it.

jasode · on Oct 31, 2016

>TeMPOraL wrote: I believe you could drop 99 of those 100 petabytes and your average user wouldn't notice.

A general purpose search engine would need that 99 petabytes of bad webpages to help machine learning algorithms classify the new and unknown web content as good or bad.

>if in some way this could run as a "personal" search engine. So I could fine tune it to avoid crawling and indexing certain sites

What you want sounds more like a "whitelist" of good sites to archive and a blacklist of sites to avoid. With the smaller storage requirements of the whitelist content, build a limited inverted index[1]. I agree that would be useful for a lot of personal uses but it's not really a homemade version of Google. Your method requires post-hoc reasoning and curation. Google's machine learning algorithm can make intelligent rankings on new content of new websites that don't exist yet.

[1]https://en.wikipedia.org/wiki/Inverted_index