But again, you could also argue that there would be fewer Linux exploits if it w...

TeMPOraL · on Oct 27, 2016

> If the core goal was simply to link to content then the incentive goes away, and so does the problem.

No, it doesn't, because the core goal of even a perfectly non-profit search engine is to sort content by relevance. That's ultimately a general-AI-complete problem that would require your computer to read your mind in order to solve it perfectly. So necessarily, a search engine is using some heuristics to approximate relevance, and those heuristics are open to gaming. Thus the cat-and-mouse game between search engines and SEO people.

There's way too much content to just link to all of it.

ksk · on Oct 27, 2016

The SEO people are only involved because google links to ads, or pages with ads. For the sake of argument lets say google opensources their engine. One could create a forked search portal that never linked to ads or pages with ads. The motivation to game such an engine would be greatly reduced IMO.

TeMPOraL · on Oct 27, 2016

I disagree. Outside the startup world, a lot of SEO is done simply to drive traffic to you so that you have chance to convert some of it into paying customers. All the small businesses that pay SEO companies to spam the shit out of the Internet don't do it for ad revenue, they do it just for organic traffic.

This motivation won't go away as long as people can use the Web to make money, so we're stuck with the pressure to game search results.

ksk · on Oct 27, 2016

Maybe I should have differentiated between link-farm type SEO and web marketing. The former is more likely to cause people to abuse the system because simply getting people to visit your website or click a link provides monetary compensation. If you drove traffic to your website I could make a hand-wavy free market argument that if your product wasn't good, it wouldn't sell and you wouldn't have money to pay someone to drive traffic to your website. I fully accept that its my own opinion.

jkaunisv1 · on Oct 27, 2016

Are there non-profit search engines?

Any domain experts who can speak to how hard/expensive it would be to run your own (a quick google shows me stuff like http://www.gigablast.com/)?

jasode · on Oct 27, 2016

Would you want the non-profit search engine to be on par with Google for quality? Google's storage just for the index of web pages is about 100 petabytes.[1][2] To compare, Wikipedia's storage is only ~50 TB[3] and it's mostly static content with no heavy cpu constantly processing the data. Relatively speaking, Wikipedia's operation is tiny and yet they're regularly running banner ads for donations to keep the lights on.

Copying an open-source search engine algorithm from github isn't enough to get a usuable search tool. And buying some 8 terabyte hard drives from Amazon to run a RAID setup in the garage won't store enough data to feed the algorithm.

[1]https://googleblog.blogspot.com/2010/06/our-new-search-index...

[2]https://www.quora.com/How-large-is-the-Google-search-index-a...

[3]https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia#Si...

TeMPOraL · on Oct 27, 2016

I believe you could drop 99 of those 100 petabytes and your average user wouldn't notice. The web is full of crap. I don't think you have to match Google's size from the start to be able to compete.

jkaunisv1 · on Oct 28, 2016

That was my thought as well. Especially if in some way this could run as a "personal" search engine. So I could fine tune it to avoid crawling and indexing certain sites - I really don't need to see anything from CNN, HuffPo, etc. If I have that urge, I can always Google it.

jasode · on Oct 31, 2016

>TeMPOraL wrote: I believe you could drop 99 of those 100 petabytes and your average user wouldn't notice.

A general purpose search engine would need that 99 petabytes of bad webpages to help machine learning algorithms classify the new and unknown web content as good or bad.

>if in some way this could run as a "personal" search engine. So I could fine tune it to avoid crawling and indexing certain sites

What you want sounds more like a "whitelist" of good sites to archive and a blacklist of sites to avoid. With the smaller storage requirements of the whitelist content, build a limited inverted index[1]. I agree that would be useful for a lot of personal uses but it's not really a homemade version of Google. Your method requires post-hoc reasoning and curation. Google's machine learning algorithm can make intelligent rankings on new content of new websites that don't exist yet.

[1]https://en.wikipedia.org/wiki/Inverted_index

jkaunisv1 · on Oct 27, 2016

Maybe not on par but close - I was expecting limitations like that, didn't realize it'd be in the petabytes! I wasn't saying or expecting that it could be done on the cheap. Just trying to think of ways to get away from the filter bubble and ad-based businesses.

I was just daydreaming, without first evaluating technical limits, that there could be a distributed search engine that runs on donated CPU/storage like those SETI and genome projects.

Thanks for the info.

posterboy · on Oct 27, 2016

But the info is misleading. The Internet Archive stored 18.5 Petabytes as of Aug. 14, yet it's a non profit

jasode · on Oct 27, 2016

The Internet Archive has less ~200 to 400 billion pages. Google has 60+ trillion pages.

The IA also doesn't spend money on constantly running map-reduce[1] jobs 24-by-7 which requires a datacenter of cpus and megawatts of electricity. By extension, they don't pay for a small army of programmers to write the ranking algorithms and tune the results.

So on their shoestring nonprofit budget of $12 million a year, you won't be able to get good results from search queries.

E.g. type "javascript" in the IA search box and the first 10 results are junk.[2] At the top is a Javascript Bible book from 2010 that's rated 3 out of 5 stars on Amazon.

Compare IA with the "javascript" results from google.com[3].

It will take a lot more money than $12 million a year[4] to upgrade IA to be similar quality to Google. The increased costs would very likely exceed the financial support of their most generous donors.

[1]https://en.wikipedia.org/wiki/Google_Panda

[2]https://archive.org/search.php?query=javascript

[3]https://www.google.com/search?q=javascript

[4]http://blog.archive.org/2015/12/26/the-internet-archive-tele...

posterboy · on Oct 27, 2016

Talk about feeding the troll ... Your argument is invalid, simply because google didn't become big because they had more money than others, but because the technology was an advantage.

Sure, expenses are a limiting factor along the way, but you use anecdotal evidence to make a point of that. Look at traffic instead, is google also single handedly stemming the whole infrastructure?

The question was about other search engines, non profit at that, and in doubt of that, about the possibility. Whether or not it would be expensive and whether or not the wikimedia foundation uses a lot of their money for meet ups and the like, there are possibly other limiting factors, that might ease the financial problem when overcome.

jasode · on Oct 27, 2016

>, but because the technology was an advantage.

Is "technology advantage" another way of saying their PageRank algorithm of mathematically iterating on a linear algebra problem was superior to other approaches from Lycos/Excite/Yahoo? Well, yes.

But that algorithm has to run on thousands of CPUs which require lots of electricity and expensive programmers to program it. The algorithm has to be expressed on real-world things that cost lots of money. Google got $25 million of VC money in 1999. 18 months later, they figured out Adwords and that brought in enough money to self-fund expansion (buy more computers and harddrives). The IPO in 2004 brought another $1.9 billion to pay for datacenters. The superior algorithm must combine with expensive massive hardware scale to deliver quality results.

>The question was about other search engines, non profit at that

If you read jkaunisv1's comment more carefully, he wanted the homemade search quality to be "close" to Google's quality. You can't achieve that on a nonprofit budget of $12 million a year. (See my links for "javascript" to get an idea of what $12 million buys you.) To be fair to IA, their main mission is archiving and not cutting-edge search algorithm quality.

jasode · on Oct 27, 2016

>With search engines, [...] If the core goal was simply to link to content then the incentive goes away, and so does the problem.

Why would the incentives go away for bad actors to game the ranking algorithms so that their bad content rises to the top of results?

Your chain of logic doesn't make sense. E.g. Since the core goal of SMTP is exchanging electronic mail -- it means the incentive for for advertising (SPAM) goes away and the problem solves itself?!?

Why would spammers and bad actors let the idealized goal of technology stop them from abusing it?

ksk · on Oct 27, 2016

>Why would the incentives go away for bad actors to game the ranking algorithms so that their bad content rises to the top of results?

Then you're talking about bad-actors who are not motivated strictly by financial gain, but by the successful spreading of their bad content. That is not really a problem specific to search engines. It applies to any medium that has the ability to disseminate information.

> E.g. Since the core goal of SMTP is exchanging electronic mail -- it means the incentive for for advertising (SPAM) goes away and the problem solves itself?!?

That is not comparing like to like. Unlike mail servers, google plays a very active role in choosing what kind of content it indexes. My point was that if they chose to simply index information and never link to ads or pages with ads, then the incentive to game the results is greatly reduced. (For e.g. - Google Scholar.)

jasode · on Oct 27, 2016

>bad-actors who are not motivated strictly by financial gain, but by the successful spreading of their bad content.

The "bad content" showing up on the top of page 1 of the search results is directly related to financial gain. That's why the SEO industry exists!

>if they chose to simply index information and never link to ads or pages with ads, then the incentive to game the results is greatly reduced.

The flaw in your logic is that it's the ads that's the primary motivation for bad actors to push their unwanted pages to the top.

Let's say I'm a bad author that wrote a bad book about "traveling to Paris". I would want my blog page about me and my book to be at the top of search results when you type "Paris" in the search box. It doesn't matter what ads are showing in the side panel. It also doesn't matter if ads didn't exist at all because every Google user would pay a $9.99/month subscription. Either way, I still want my webpage at the top so that some percentage of search users click on my (non-ad) link and buy my sightseeing book.

ksk · on Oct 27, 2016

>The flaw in your logic is that it's the ads that's the primary motivation for bad actors to push their unwanted pages to the top.

Yes, that is my opinion.

> I would want my blog page about me and my book to be at the top of search results when you type "Paris" in the search box. It doesn't matter what ads are showing in the side panel. It also doesn't matter if ads didn't exist at all because every Google user would pay a $9.99/month subscription. Either way, I still want my webpage at the top so that some percentage of search users click on my (non-ad) link and buy my sightseeing book.

Okay, but that is simply marketing. You are not assured of any monetary compensation if someone visits your website or clicks on a link. Its like 'buying' influence in a bookstore to make sure your book is displayed prominently. I was primarily thinking about the SEO surrounding link farms and other similarly abhorrent web sites.

jasode · on Oct 27, 2016

>Okay, but that is simply marketing. You are not assured of any monetary compensation

It doesn't matter that I'm not 100% assured of a sale.

What matters is that I have used up the finite space of pixels with a link to my bad webpage. Add in other bad actors like me gaming the ranking system and now the first page of results is full of spammy web pages. The "good" links such as a wikipedia article about Paris is completely pushed off the first page. If we get 0.01% of eyeballs to our "bad" links converted as sales, that's still better than zero.

As a user of Google and Bing search engines, I do not want any government to force them to publish their algorithms. It will make the search results worse. Keeping the algorithms and heuristics a secret is a valid way to fight abusive gaming of the system.

The opaque strategy for search rankings is not the same issue as a closed-source encryption algorithm with a backdoor.

ksk · on Oct 28, 2016

>If we get 0.01% of eyeballs to our "bad" links converted as sales, that's still better than zero.

Where did you get the 0.01% from ? That sounds like a rather high number.

>It will make the search results worse. Keeping the algorithms and heuristics a secret is a valid way to fight abusive gaming of the system.

I understand that this is your opinion. Thankfully, there is no evidence that this is true.

jasode · on Oct 31, 2016

>Where did you get the 0.01% from ? That sounds like a rather high number.

Why are you distracted by that 0.01%? Whether it's 0.001% or 0.00001%, it's still higher than zero. It still drives the incentives to push bad pages to the top of search results.

>, there is no evidence that this is true.

The evidence is Google's ongoing evolution of algorithms from the 1998 PageRank paper[1]... to Panda... to Penguin... to Hummingbird. All that constant rewriting is to stay ahead of the abusers gaming the search algorithms.

The link farms were created by spammers based on the information about Pagerank and using its link-weighting to the spammer's advantage. The public knowledge of how PageRank works allowed spammers to make search results worse. There are many examples[2] of of link spam that is not motivated by the "ads" in the right side of the page. What you call "web marketing" is often bad pages made visible by abusive SEO techniques.

Panda was a response to this gaming by analyzing extra signals to penalize link farms. Abusers continued to game Panda's revised algorithms and Google responded with Penguin.[3] Just stop and think deeply about why we can't just use Google's original 1998 PageRank algorithm unchanged in 2016.

At this point, making all of the algorithms and weights of Hummingbird public will only give the spammers the necessary information to make the search results worse for the rest of us.

[1]http://infolab.stanford.edu/pub/papers/google.pdf

[2]https://en.wikipedia.org/wiki/Google_penalty#Common_forms_of...

[3]http://searchengineland.com/google-launches-update-targeting...

ori_b · on Oct 28, 2016

If your belief is true, why does email spam exist? There is no way to way to tell if they were opened (especially with modern email clients that will not load external links), so there is no direct advertising happening.