Hacker News new | past | comments | ask | show | jobs | submit login
Google doesn't recognise or penalise stolen content (pi-datametrics.com)
98 points by ollieglass on Nov 2, 2015 | hide | past | favorite | 77 comments



I think this is pretty fair on Google's part. How could you possibly figure out who owned content?

What if I published a book, it was copy-pasted in blogs, and then later I put it somewhere crawlable by Google? You certainly can't just say "first time we saw it, that's the proper owner". It would either require a massive amount of manual QA to get right (and even then, there are going to be interminable copyright battles), or have a super high error rate.

I think Google's best value is letting proper content owners easily find violators via normal searches, and let them deal with them via takedown notices or the court system -- which is where it should be done, not in a pseudo-court run by a Google who does not want what responsibility.


So back when Blekko was a consumer search engine we could 100% figure out who owned content on sites we crawled often. And even when we didn't we could often guess correctly more often than not based on the domain registration dates. (not to mention registry owners). That is because few people who rip off content rip off just one web site, they will rip off dozens of web sites and they will all share the same AdSense ids and the same domain registrar. This is easy stuff to spot when you crawl the web regularly.

I suspect that Google simply doesn't care. They get Ad revenue regardless and in their laissez-faire editorial position it doesn't matter. What are you going to do, use another search engine?


Or, more likely, they can't get involved for legal reasons. If they took steps to block the easy stuff, an arms race would ensue, and the content providers would never be satisfied with the performance being provided for free by Google. The content providers would always demand stricter enforcement, and could threaten to sue for copyright infringement regardless of merit.


I think if that were the case they would not have pushed out the "Panda" updates which penalized content farms so heavily. If their past behavior (with content farms and other "low value" content sites) is a guide they will not do anything until enough people complain about it.

In the mean time it isn't even Google's content so its not a hosting issue, they are just the "neutral" third party providing their 10 blue links (oh and supplying the advertising engine those sites are using)


Do you know if any search engine is actively filtering for this?


Sadly no, there are only a few actual derived general search indexes (english language search) in the world, Microsoft's, Google's, Baidu's, and Yandex's. They are expensive to build and maintain and the only way to monetize them requires driving search traffic your way. Google is paying $4B/year to third parties to send search traffic their way.

My guess, having been at both Google and Blekko, is that "whitelisted" search will be the next wave in the industry. For those old enough to remember Yahoo!'s original "directory" model, once Yahoo!'s contract with Microsoft is up one could hope they rebuild their search team and technology into something with a strong editorial bias for "quality" content.


Did you get that $4 billion number from that quarterly results? Does that include things like their payments to Opera and Apple? Does it include search rev share deals with entities like AOL and Ask (& soon to be Yahoo)? Does it include paying from Chrome distribution bundled with Flash security updates & such? I have never seen the overall numbers broken down in terms of what percent goes where on the different sorts of syndication deals.

Three things which would be a major issue for Yahoo! on that sort of search would perhaps be first that they themselves rely so heavily on content syndication to power their various verticals, second they keep losing search market share (especially as more search happens on mobile devices and Google has mobile locked down with their Android contracts), and they also screwed up their old directory before they moved it to Yahoo! small business as part of the Alibaba share spinco.

I also don't see how Yahoo would effectively differentiate their search engine enough to be able to (profitably) buy share at prices set by Google, particularly if they over-promote their internal results & rely on a smaller search index.


Prior to restructuring their reporting, Google reported as a cost paid distribution. I left in 2010, I started tracking the number in Q1 2011. It was $337M for the quarter. by Q4 of 2014 that number had ballooned to $968M for the quarter. In 2015 they changed the way the reported this number making future comparisons problematic.

I expect it does include fees paid to Apple so that Apple would send search traffic to Google, and fees paid to browser vendors.

Our experience as a search results provider was that there was demand for a more 'functional' search capability (not casual searching) many of the techniques we used have been adopted by Microsoft in their Bing engine which has improved both their recall and quality with respect to Google results on highly contested searches.

I certainly agree that Yahoo! has made a number of missteps with their search technology. I talked with them once (post Marissa's arrival) and in many ways they were confused as ever about how search engines generate value for the parent company, but such things are rarely permanent.


Thanks for sharing that :)

One interesting bit from the most recent IAC investor conference call is on it they mentioned that their search deal with Google was renewed for another 4 years & that the rev share on mobile was lower than it was in the past. An analyst asking a question mentioned both Google and Yahoo! were lowering revenue share on mobile.


> "whitelisted" search will be the next wave in the industry

> ... strong editorial bias ...

Interesting. Care to elaborate?

When you use words like 'whitelisted' and 'editorial', I imagine humans adding something to a database one by one. But the volume of useful pages (and the number of site) is really large now, so I guess that's not what you mean.

One thing I like about search today is that it's almost comprehensive. If I know something exists on the (open) web, I can usually find it with a few searches, even if it's very recent or obscure. I don't want to go back to the days when I browsed gopher directories, or even to the days when finding good quality content meant a hierarchical journey from a directory to a site, to a site map, to an individual page.


Also all of this is assuming that any duplicated content is inherently stolen, when it could in fact be public domain, fair use, legitimately licensed, distributed under Creative Commons, etc.


True, but in these cases I would still want original/canonical/fastest/best source first while the others are probably only valuable as backups.


People have been complaining about this problem for a LONG time.

Duplicate removal is essential for making a web search engine that works. For instance, together with a CS research group, I built a search engine for a major university library that had more than 80 web sites. We found huge amounts of duplicate content produced by various mechanisms (for instance, multiple people posted the same stuff to the web.) If your ranking is content-based, all of the duplicate documents are going to rank the same and form a "plug" that excludes other documents.

It has long (post 2006) been a common story that "I wrote a blog post but somebody else ranks for it." For instance, I made a blog post that got a huge amount of traffic in the day, but right now you search for it and you find a presentation from some fresher at Oracle that is based on those ideas.

There are many factors that make this hard to control and these include: (1) for one "real" origin there are probably ten or a hundred fakes, so if you are picking at random you strike out -- you have to not only outrank one fake you have to outrank all the fakes, (2) freshness... copies are fresher than the original, also they can be updated years later, (3) also the bad guys think a lot more seriously about indexation, Page Rank, and other variables they control than do most content creators.


Even if you don't care about trying to identify the original source of some piece of content, it seems like the content farm site which is plagiarizing is more likely to be a lower quality site than the original content producer.

The behavior does seem weird in any case, like there is a certain slot for a given piece of content, and Google is swapping different domains in and out to fill that slot. It seems like Google is actually trying to identify the original content, failing, and then actually inadvertently penalizing the original producer.


Well, Google has already indexed a new article x. When article y appears, and Google sees that y is an almost verbatim repeat of x, it shouldn't be that hard to figure out that article x is the original, should it? Especially if they both have time/date stamps....


one man's stolen content is another man's mirror. There are countless times where original content is region-blocked or behind a paywall, or expired, but accessible via "stolen" links.


While most of what you say is true, they could at the very least reject AdSense applicants based on how often they copy-paste content from established publishers. I’m sure they have the means to figure something like that out.


I was huge into SEO for a few years. I try to stay out of it now, but it's worth noting that this is almost certainly due to the current algorithm's obsession with "freshness." The weaker site is ranking higher with the stolen content because their site was updated more recently. Steal some back and I bet they swap ranks again.

Also, the combination of the pagerank algorithm and normal user behavior typically helps Google to understand who was first and who deserves to rank higher. That is, most people don't plagiarize content, they quote it and then cite the source, which (thanks to pagerank) tends to rank the original better than sites which have plagiarized it.


> the current algorithm's obsession with "freshness."

Which is how Google makes blogspam such a good business to be in, even if your content is inferior to the post you used for "research".


Most of the spam I see in the wild these days is indeed (established) dropped domains which were picked up and then loaded with thousands of pages of "fresh" spun content, with an incestuous backlink profile if any. So indeed 'blogspam'. Everything old is new again; it feels just like twelve years ago. Soon people will be keyword stuffing in a font the same color as the background...

But Google certainly isn't intending to make blogspam a good business to be in, and I'd argue that they aren't; over the past four years Demand Media's stockprice has fallen from $400/share to $4, and the general marketplace for commoditized SEO services has shrunk by a similar degree over the same period. 19 out of every 20 SEOs who were active five years ago have thrown in the towel... just check alexa graphs for the top SEO forums.

The SERPs are clean these days. Google has done an amazing job every year for at least thirteen years now of improving them constantly. The new wave of spam is social. In practice this means Buzzfeed writers stealing user-produced content from AskReddit threads and it ending up polluting my Facebook feed to the point that I can't even find any good counterfeit Raybans.


"The SERPs are clean these days."

Here's an alternate take on that http://www.johnon.com/1075/bullish-on-seo-rankbrain-vs-seobr...


That's fascinating because I really don't understand it. Maybe I'm just out of touch with SEO, but things like this escape me completely:

"This is because SEOs follow and influence the intent of searchers in the marketplace, while Google’s algorithm (and AI) merely monetizes it."

Where does the extra monetization on page 1 results come from? Unless he's implying that Google provides bad search results so that people will click the ads instead.....


There are numerous ways to interpret that. At a base level, one could look at how the mobile search results are sometimes a screen full of ads, or how in some verticals they are a screen full of ads followed by yet another screen full of ads.

And then there is the knowledge graph & other flavors of scrape-n-displace, which is largely content recycled from elsewhere, given prominent positioning not based on merit or editorial quality, but based on who the publisher (or recycler) is.

Another parallel trend would be the confirmation bias / brand bias factors promoting older and staler sites. Or simplified "take" articles in the mainstream media rather than the original source articles on niche hobbyist blogs and forums or such.

And in taking broad sets of new niche intents and trying to guide those streams of users back down well worn paths. For example, sometimes when you want to find a particular news story about a broad & well-known web platform like Apple, Amazon, Facebook, or Google it can be hard to find sites other than the official site. And on some other longtail queries Google rewrites what is being searched for in a way that brings up some results that don't match the true searcher intent. Probably the best example I can come up with on this front is say you wanted a pair of shoes of a specific brand, size, width, and model number. If they are not the most recent and most heavily marketed versions it can be tough. Auto-generated internal search pages on trusted brand sites rank well, while a small retailer carrying that specific shoe might be penalized by Panda.


the current algorithm's obsession with "freshness."

That explains why I've noticed some older sites which are still around, and have plenty of detailed technical information, seem to have disappeared from the search results. Somewhat sad that the "newer is better" mentality appears to have taken over completely... if I really wanted the newest things I'd look at Google News.


I guess the problem is, there are some technical fields where old means useless. If I'm googling for Javascript libraries, hardware recommendations, or a fix to a package conflict in Ubuntu, I don't want something from 2010.


This came up before on YC.[1] Google does have a system to detect provenance, but you have to report your changes to Google as an RSS feed.[2] Google hasn't updated that page since 2010, and it may no longer do anything.

[1] https://news.ycombinator.com/item?id=10103545 [2] https://pubsubhubbub.appspot.com/


And they shouldn't. Thats not their job.


Their job is, however, to direct people to the most relevant pages.


What is their job? I thought it was as a search engine. So, let me ask -- when people steal blog content, what are their motives for doing so? Is it to deliver that content to you, the reader? Or is it to get Google hits? How well are they preserving links, illustrations, reader comments (which are a disaster a lot of places but not all of them), an archive of other work by the same author that may be of interest? How often are they slipping undesirable things (ads that lead to sites that offer malware, for instance) alongside the content they're stealing?

If Google wants to be the best search engine possible, returning the original result for an article relevant to the user's query is a better result than returning some second-hand copy littered with low-quality ad junk. And if that's not Google job, then let me know whose job it is and I'll start using them instead.


As it is not the job of the street vendor to know from where come these Rolex.


I don't know if you're being sarcastic, but it is illegal to sell stolen or fake merchandise in the United States. Anyone selling fake Rolexes is committing a crime and could also be sued.

Saying that Google is "selling" stolen content isn't that clear, though. Yes, they're selling ads on search results, but wouldn't they get the same ad revenue regardless of where those links pointed?

It's easier to make the case with AdSense, where Google literally profits directly from stolen content.


it is illegal to sell stolen or fake merchandise in the United States. Anyone selling fake Rolexes is committing a crime and could also be sued

Huh? Now I'm worried. Are you telling me that the Rolex watch I paid $30 for, that I bought from a street vendor near Times Square, might be fake? Oh no, the horror! /sarcasm

I don't think that Rolex is too worried about this. Nobody would mistake a $30 watch for a real Rolex. And, give it credit, my fake Rolex worked for a year or so. It probably just needs a new battery.

Besides, you can't sue a street vendor. They're what's known as "judgement proof".[1]

And as to police action against them, the de Blasio administration seems to have adopted a laissez-faire attitude about all this stuff. If they're willing to allow squeegee men to operate with impunity, they certainly won't care about novelty watches being peddled.

[1] https://en.wikipedia.org/wiki/Judgement_proof


This may be pedantic, but is stolen the right word to use?

I think plagiarized is more accurate.


Plagiarism can also apply to copying content without using the exact same wording. In my mind, "stolen" means copying verbatim.

Plagiarism comes from a word meaning "kidnapping" though, so the tone of both words is pretty similar.


Stolen implies that the original owner no longer has access to the data due to the actions of the perpetrator.


For physical property, yes. For intellectual property, it can still be stolen even if the original owner still has a copy.

e.g. The Soviet spies stole the plans for the hydrogen bomb.


No, for intellectual property the word "stolen" is not appropriate.

(Well, unless you're talking about getting the courts to tell the original owner it's yours instead. Which has been at least attempted a few times.)


It is semantically appropriate for the word steal. I think you're confusing it with the criminal implications.


Ah, so you mean the actual blueprints? And the poor Americans had no copies? How stupid of them!


So... copied.


The original owner no longer has access to the revenue stream so yes, the implication is correct.


Then, say "potential revenue" was stolen, not content.



A different angle is to look at what is accepted in a court of law. Laws do not consider copying/infringement to be stealing. I know some people who say gunna (I was gunna do it). However, that doesn't make it correct. So, usage is not a measure of validity either. Most online dictionaries exclude steal/theft being linked to infringement/copyright.


EDIT: Look at OED's additions list from June 2015, specifically under 'g': http://public.oed.com/the-oed-today/recent-updates-to-the-oe...

'gunna'. Is it still wrong?

--original comment--

Common usage does eventually lead to validity, though. Language is not static and evolves through usage. Besides, how are you going to measure validity? Is the OED the sole arbiter of what's "correct"? Common usage and mutual understanding is a great way to determine what's "correct" in a language.

I won't debate the legal usage of the word, as there was no mention of legal interpretation by the courts in the previous comments. However you are correct from a legal perspective where words have specific meanings.


So now Merriam-Webster has become the Twelfth Circuit? This is the lamest jurisdiction shopping I've seen in a while.


What a rubbish article.. It doesn't fly for a second under copyright law. Google is entirely within their rights doing what they're doing. The onus isn't on Google to detect the infringing content.

For anyone interested in copyright and legal issues, I'd recommend checking out techdirt.com. They have a great starter section at https://www.techdirt.com/blog/?tag=techdirt+feature, and they cover legal, copyright, patent, surveillance and all sorts of related topics. High quality journalism.


Is this any better/worse than Facebook actively trying to profit and win over users when people or organizations copy / upload / soak up views for material they did not create and don't have the rights to use? Because that's a hot-point of discussion in some creative circles as well.


Facebook's freebooting is pretty terrible. But this can destroy entire websites. The title is misleading. Google isn't just not punishing thieves, it's heavily punishing the originals. They dropped from 20th result, to 100+, because someone stole their content.


Yikes! That is much worse, at least based on your note. Do you think this is an area where the EFF could litigate on behalf of the original creators in a fraud context? Just curious, and also grateful to not be dealing with such a horrible prospect.


Recognizing "stolen" content autonomously can only pivot on knowing when something was first published or visible to Google, which is a pretty dubious measurement.


But the weird thing is the stolen content, even when it's on a crappier site with no shares etc., knocking the original out of its slot. I.e., something (probably freshness) is causing the stolen content on the crappy site to be higher-ranked than the original content on the strong site. That seems avoidable (and in Google's best interest).


The title should more accurately be Google search. Many parts of Google, for instance Youtube definitely do penalize stolen content.


I have had a site that had UGC spam promoting dodgy TV streams - get hit with a penalty.


Advertisement post?


Yes but a very interesting one.


This is not stealing, and even if it is illegal that is a bad way to put it. Also, google's service is primarily to the searcher so this isn't a huge issue for them.


> google's service is primarily to the searcher

You are very wrong. For many years, nearly 100% of Google's revenue was from AdSense.

What if someone spends days writing an article and posts it on his blog. Then, someone else copies and pastes it onto BuzzFeed, which becomes the top search result for that topic.

BuzzFeed is making money that the same blogger would have made from his own content. Now, also assume Google serves ads to BuzzFeed, but it does not serve ads to the blogger. Google has a financial interest in ignoring the provenance of the content in this case.

Is all of that ethically acceptable?


> You are very wrong. For many years, nearly 100% of Google's revenue was from AdSense.

That's incredibly untrue. A substantial portion of Google's revenue has always been and continues to be from first-party AdWords ads.

The fact that you're using BuzzFeed as an example, a firm which emphatically does not use display ads, shows how little you know about this.


> For many years, nearly 100% of Google's revenue was from AdSense

Google has always gotten a very large majority of its advertising revenue from ads on its own sites, not on third party sites.


It's downright tragic to ask for more rules.


There is no such thing as "Stolen" content.


If that's true, there's also no such thing as "stealing" at all.

Consider a novelist who works for 10 years on her novel. A hacker steals the document from her computer and publishes it online under his own name. He makes $100M.

Is it wrong for the novelist to feel like someone stole from her? What word would you use instead?


How are people still arguing about this?

Saying that someone is "stealing" when they infringe copyright is like saying someone is "killing you" when they present convincing arguments against your cause. It isn't literally stealing or killing, it's an exaggeration made for emphasis.

The reason there is so much contention is that a) the same language has been extremely common among hysterical content industry lobbyists who insist that it is literally stealing, and b) stealing and copyright infringement are both unlawful (and therefore more easily confused) even though there remains a meaningful distinction between stealing and copying.

But that distinction is very important in practice because we can't treat stealing and infringement the same. If you don't like someone's speech you can't be allowed to steal any of their webservers but you have to be allowed to copy some of their work in order to effectively criticize them.


> What word would you use instead?

Infringing. (duh)


I can't tell if you're joking, but "infringe" means "to violate" which implies that there is a law or agreement that's being broken. That makes it sound like you agree with the idea that this is stealing.


I absolutely do not agree with the idea that this is stealing. How can it be stealing, when the owner still has the thing that was supposedly stolen?

Different circumstances, different terminology. The correct terminology (see US Title 17 or CDPA 1988) is "infringing". Anyone who insists on using the word "stolen" is signalling their ignorance of the first, most basic fact of copyright law.


In my example, the specific crime may not have been stealing, but there was revenue stolen.


but there was revenue stolen.

1.) If the item is being given away for free, there can still be infringement.

2.) If a person would never purchase an item at the available price (due to the law of supply and demand for example), that person might still infringe. No revenue was lost or gained since the transaction would never have completed at the existing price.

In either of those cases, no revenue was "stolen", but infringement still occurred. These are some of the many reasons that stealing isn't a good way to describe copyright infringement.


So it is "stealing" to produce a superior, cheaper, but otherwise virtually identical product or service that sells better and displaces a competitor's revenue? Strange, I thought that was the whole basis of market economics.


The revenue was lost. Look at legal web sites. There is a specific vocabulary. The language you are using is from what I call "Mcadonalds Journalism" sites who have a vested interest in vilifying anyone who infringes. By politicising the language, these sites use emotive language to sway your views. I'm sure these are articles on this. It's similar to yellow journalism.


There are all kinds of violations that are not stealing.

If I point a gun to your face and take your money, that's a law being broken, but taking that money is not stealing, it's robbery.

If I threaten to expose some dirty secrets and demand money or things from you, that's blackmail but not stealing.

If I take your textual content and re-publish it under my own name, then that's an infringing use but again, not stealing.


The law you're looking for is ~copyright~. The fact that the work isn't actually stolen is the keystone in how you can take the copyright case to trial. It's pretty hard to prove infringement if you have no record of the original work (ie it's been stolen).


Stealing means the victim doesn't have the stolen item anymore.


The book has not yet been published, so this is stealing. But once you put up something in the internet, it is officially available to everyone. Doing stuff with public information is fine IMO. Same as analyzing tweet data (tweets are public).


What if the book has been published in paper form? The book is public as long as you pay for it.

Since internet isn't free, you're still paying for content. At what point are you paying "enough" that the information isn't public anymore?

Are you saying no one should monetize their content using ads unless they're willing to allow anyone else to do that as well?


It is worth pointing out just how pissed off Google engineers were publicly when they felt Bing was copying their search results.

https://googleblog.blogspot.com/2011/02/microsofts-bing-uses...

http://searchengineland.com/google-bing-is-cheating-copying-...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: