Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Legality of Web Scraping User-Submitted Content?
13 points by brandon272 on Aug 6, 2008 | hide | past | favorite | 47 comments
Is it legal to scrape another website for user submitted content? For example, if there was a news site that got its news solely from users submitting their own personally-written articles and stories -- who owns that content? Does the end user relinquish ownership once they submit that content?

Is it legal for another company to "scrape" that content and use it on their site, only removing it if the user who submitted it in the first place asks them to, either directly or via legal means?

Thoughts and insight would be appreciated!




It is most likely not legal.

The user owns copyright to the article or story that he wrote. This ownership of copyright gives the user the right to decide how it is distributed. For you to use the material legally, you must get permission of the copyright holder. The user may be the copyright holder, or the site may be if the user transfers the ownership. Either way, someone owns the copyright and it is not you. You need permission.

There are two defenses to copyright infringement: 1) fair use, and 2) parody.

Parody probably doesn't fit here. So that means you need to make a case for fair use.

Fair use has 4 elements:

1: The purpose of the use (commercial vs. non-commercial/educational) - if you are going to make money on this, fair use is out.

2. Nature of the copyrighted work - This doesn't really apply here, so I won't go into the lengthy explanation.

3. Amount of portion used in relation to the whole - Did you extract a quote? That is probably ok. Did you copy the entire article? Probably not ok.

4. Effect upon the market - if your site harms the market of the other site, no fair use.


It is not true that commercial purpose precludes a fair use defense. From wikipedia:

"While commercial copying for profit work may make it harder to qualify as fair use, it does not make it impossible."

see: http://en.wikipedia.org/wiki/Fair_use


In this case it does.

The use in the case you are referring to had several extra elements which prompted the court to not find copyright infringement.

2 Live Crew was fair use because their work was transformative. They took a beat and changed words to a Roy Orbison song. They did not blindy copy, as the question asker will be doing. They took the score, not the words.

The poster is taking the story, not quotes. His site is not transformative, and therefore his commercial use is not fair


And, they lost the Van Halen lawsuit.


One of the most dangerous things you can do is rely on non-lawyers for legal advice. Ask an attorney.


Meh, if everyone listened to their lawyer we wouldn't have YouTube or BitTorrent :)


I wouldn't suggest you listen to them blindly, but you're crazy if you think YouTube didn't ask actual attorneys for a risk assessment (their VCs would have required that). They certainly didn't ask a bunch of hackers.


insert JP Morgan quote


For those who've never heard of this gem :)

Well, I don't know as I want a lawyer to tell me what I cannot do. I hire him to tell how to do what I want to do. ~ J. P. Morgan


haha :)


Attorneys are expensive, Hacker News is free.


Attorneys know what they're talking about where law is concerned, hackers don't but often don't realize that. Asking a hacker for legal advice is just as bad as asking a lawyer for programming advice. Actually it's probably worse, since nobody ever ended up in jail for a coding error.


How do you know some readers here don't read Supreme Court decisions because they're interested in law, even if they never went to law school? How do you know your lawyer doesn't hack on Arc on weekends? :)

Seriously, though: of course the person asking the question should retain a lawyer for definitive guidance. However, asking around to get a feel for the opinions of intelligent people makes sense, too. It might, at the very least, clarify the questions, and it will help detect if the lawyer's words make sense. While attorneys are valuable, they are also fallible, and their advice should be as critically considered as any other.


Well, an attorney's opinion on the law is certainly not infallible, but it should be considered a lot less critically than a hacker's.

The danger is that misinformation is often worse than no information at all, and asking a bunch of non-attorneys for legal advice is sure to net you a ton of it.


Oh, but we wish they did sometimes :)


This should seriously be in the FAQ somewhere - We've seen a lot of it lately, and it's kind of surprising the ideas that some otherwise very bright people have when it comes to the law.


Absolutely, it's ridiculous to pose legal questions here. If I were running the place I'd probably remove them. It's actually dangerous to both the poster (whose business could be ruined by misinformation) and those who respond.


That's true.

But some answers are so blazingly obvious or common knowledge that it's worth asking. It might at least speed up a conversation with a lawyer saving many hundreds of $s.


The real question is usually not "Is this legal" but rather "Will this get us sued". We do many "illegal" things like jaywalking all the time, but everyone with sense knows which things are truly illegal.


Technically, you own whatever you write. So I own the copyright on this message, unless I granted the copyright to YC News when I signed up (I don't remember). Assuming that I did not, then I retain copyright, and you have to get the ok from me, individually.

That's the theory anyway. Talk to a lawyer.


Check to see if the site that you're considering scraping has any sort of "terms and conditions", "terms of use", "legals", etc page. If it does, read that page in detail. If it doesn't, ask the site for permission to scrape.

If the site doesn't have one of those pages and you don't ask for permission, you're not only putting yourself in danger of legal action, but you're also depending on a data source that isn't reliable.

Remember, it'd only be a matter of time before they notice you scraping, and take measures to stop it.


Well, there's the real world answer, and the legal answer. Not too sure about the legal answer, thats always murky. Some things are copyrighted, some aren't. You can't just take someone's story, or art, and use it on your own site without permission. Usually the creator has all rights until they are given away. Even if they have uploaded it to another site, that doesn't mean they have given you permission though. On the other hand, some information can't be controlled. If Bob says it's 70 degrees in San Francisco right now, you can totally say its 70 degrees. If the Mets won, you're welcome to say the Mets won. The Drudge Report does nothing except report headlines.

In the real world, however, people steal information all the time. Its not polite though. Usually people ask for attributes. I don't think a policy of only removing it when someone asks you to would be polite. That's like stealing something when no-one is home and leaving a note saying you'll return it if they ask you to.

Also, pure scraping, even of non-copyrighted information can get you into trouble if the other person had paid for that information, like a news site. They pay for their news. Scraping it and making your own news site (with full content, not just headlines) is illegal.

So the short answer is that the original creator still owns the content, and no you probably can't have it.


I guess my followup with that would be this:

What if the user is submitting content that they would very clearly want re-distributed? I guess my question with that is, if you take the end-user who is submitting the content out of the equation, does the site that is being scraped from have any kind of leg to stand on if they don't want you scraping their content (excluding technical means that they might implement), assuming that the site being scraped from does not force their end-user who is submitting the content to agree that the site that they are submitting the content to "owns" the content, once it is submitted?


The myspace suicide case really muddles this and makes it a problem. If the site you want to scrape from has TOS that say you can't re-use the info, (and it most likely would), then re-using the info would be a violation of the terms of service. Basically, in the MySpace case the feds are trying to make it a crime to violate a website's terms of service.

But like someone else said, if someone wants their content distributed, the user is not going to give you any trouble, and if they do, you're very protected if you take the offended material down right away, as someone else mentioned.

The website you're stealing from does have legal measures they can take, especially if you're directly competing with them. I think someone else mentioned that they could also just reconfigure their website to mess up your scrape, which is probably what they'll keep doing. They'll also likely publicize your bad business practices and you'll end up with a horrible reputation. Nobody likes a copier. Its like people who steal designs. They hardly ever outperform the website they're copying.


In the legal world, whenever text is set in fixed form, it is automatically copyrighted by its author, whether they claim it or not, unless they specifically say otherwise. "Specifically say otherwise" probably includes anything they might have agreed to in the TOS for a particular site, although the legal waters there are largely untested. The only exemption to this auto-copyright is statistics, phone numbers, or other non-copyrightable content (see CBC vs. MLB).

IANAL, but depending on the nature of your service and specifically how the content is collected, you may qualify for DMCA safe harbor protections. This means that if you remove things in a timely manner upon request nobody can sue you. This is how Google caches the whole internet without getting sued.

That's all legal mumbo-jumbo. The real world answer is that some people will get mad regardless of the law, so take their content down and apologize. Follow robots.txt guidelines. Don't post takedown replys a la PirateBay. Generally act sane. If you do all of the above you'll probably be ok.


DMCA safe harbor applies when you are a service provider.

Examples are an ISP that merely provides access to the internet. The ISP cannot be sued for copyright infringement just because infringing bits passed through its servers.

Also, a message board that posts whatever a user writes would obtain DMCA safe harbor. They just provide a service and don't screen out for content. An example is Craigslist.

This person's site DOES NOT enjoy DMCA safe harbor. He scraped the other site and populated his own. He is not even acting as a service provider as the DMCA statute prescribes.


My "not a lawyer" answer to this is:

It depends on the terms of service of the site.

#1 The TOS of the site may not let you use a robot on their site at all

#2 The TOS will define who owns the user generated content (UGC), either the user or the site

#3 Depending who owns the UGC you may or may not be able to scrape it, if it's the site it's against their TOS if it's the user you would need permission from the user.

#4 As other people have said fair use might come into play. If the site owns the material using a single user contribution might be fair use within the context of the whole site. If the users own the content you are likely to be using all their content, thereby not able to use fair use.

Again all of these are my observations. Hopefully it will give you something to think about. If you are starting a business based on this, you do need to consult a lawyer. Also starting a business based on page scraping is a pretty risky thing to do. If the scraped site turn you off you could be pretty screwed.


I'm not a lawyer, but had to do a bunch of legal research regarding this topic for some of my sites.

I think what most of the responses so far are missing is the importance of accrediting the content to the content owner (likely the site, not the contributing users), and providing a link to the source.

Check out this pdf <a href="http://www.law.berkeley.edu/journals/btlj/articles/vol16/sab...">Sableman's authorized linking</a> and search on Google v. Perfect 10.

You haven't really given much to go on with respect to what you are scraping, and what you plan to do with it. But I think a bunch of common sense and ensuring that your site in no way harms the original source's site (defamation, etc), are the most important things to consider.

Hope this helps.


A few years back I wanted to set up a menu service - putting a bunch of restaurant menus online so that others could search. My lawyer advised that because the restaurant made these materials open in the public domain, we could basically do whatever we wanted with them. As if they were public property.

I'm not sure how this relates to other kind of content from a legal standpoint, but I've used it to ask myself 'did this person intend for this information to be public' as sort of an ethical guideline.


Woah, that is seriously not correct advice. Was your lawyer even a copyright lawyer?


Yeah, I'm not sure "chicken, $10" can be copyrighted. The image of a menu might be a problem, but the text? Not to mention his lawyer may have been giving practical advice (hopefully acknowledged as such): what restaurant would really argue with their menu being available?


Isn't there a defense for factual information? The fact that a particular restaurant sells chicken for $10 cannot be copyrighted. The look and feel of their menu (like, the squiggly decorations, the font, the layout) can be copyrighted.


Exactly. If all menus were just "BLT, $5", "Hamburger with tomatoes, lettuce, and onions, $6" then you could make the analogy.

However, compare "basic facts" with an example from the menu from my favourite restaurant (www.chachachar.com.au) (and an example of fair use!):

"Eye Fillet Age 28-36 mths old Lean with sweet, clean, toasty flavours Served with truffled mash potato, salad of grilled pear & pancetta with smoked tomato relish

Hereford sourced from fertile pastures of NSW raised on Cattle Care accredited properties that are managed under stringent quality & environmental systems and are finished on feedlots in Jindalee."


Thats what I was thinking. A phone directory can't be copyrighted. Stealing the database is a crime though. I think thats what the original question was really about: How far can he go?



We consulted a copyright lawyer, but I never moved forward with this, so did not really did too deep.

I've noticed there are plenty of menu services that take this same approach.. just do a google search and you will see there are many guides putting menus online without a license from the restaurant. What am I missing?


They are not 'open in the public domain' nor can they be treated as 'public property'. Somebody wrote those menus from scratch and as such they own the copyright. If you wanted to take an excerpt from the menu and comment on it, then that is fair use. When reproducing it in full, they technically can request you remove it, or any of the other avenues for restitution copyright law affords. Most won't because it benefits them to have their menu available and widely disseminated - which is why they give them away in the first place. But don't confuse them giving the content away as it then being your property you can then do what you please with.


I actually think it may be more of a gray area than you think, read the description of Public domain here: http://www.publicdomainsherpa.com/ - then think about how restaurants use their content when distributing to-go menus, advertising, etc, etc.

So, I called my friend who does have a menu service. His lawyer told him that case law may support the content of menus being public domain, but NOT the trademarks (like logos).

More important, he says, is the business issue. That is, the restaurant (or as it applies to this thread, the content owner) must PROVE that damages were caused by you, in order to take legal action.


Where are you getting your information from? Since when do you have to prove damages to uphold copyright law? Have you been paying any attention at all to the RIAA?

Public domain is what it sounds like. Things in the public domain are available for use by anyone for any (legal) purpose. Only once copyright expires (or voluntarily relinquished) does a work enter the public domain. Menus are copyrightable, and automatically granted copyright. Regardless of the "business use", the business owns them and their use.

Who are these lawyers and why are they making such a hash of copyright law? Why are you guys trusting them?


The RIAA is claiming harm for uncompensated use and distribution of a content they sell.. its completely different.

Did you read the definition of public domain on the link I just sent you? Apparently not. Here you go:

* their copyrights have expired; or * the copyright owner didn’t follow certain required formalities (so they didn’t get a valid copyright); or * the works weren’t eligible for copyright in the first place; or * their creators dedicated them to the public domain.

You still think that the content of restaurant menus is copyrighted without question? Go down the street and let me know where I can find the formal copyright for 'Ben's Kabob's' - or whatever local diner is in your neighborhood. Even something like Chili's could be argued to be public domain because of the way they are already allowing use of their content for marketing purposes.

Who are these lawyers? And, who are the lawyers for the dozens of menus guides I can find on google? Why do all these entrepreneurs 'believe' them that it is a safe venture endeavor? Well, it seems like they are telling a consistent story, there have been zero problems after several years, and overall, its just a question of risk like any other decision you make in your business. Its not a perfect world, and sometimes you just have to take the advice of the best domain experts you can find and move forward to focus on building a profitable enterprise.


I did go to the website you linked, and it agreed with everything I'd posted so far.

RIAA does not have proof of loss of damages in a lot of cases. Neither does MPAA. http://blog.wired.com/27bstroke6/2008/06/mpaa-says-no-pr.htm....

Besides that is irrelevant. What you meant was you need proof if you are suing for damages. You don't need proof of damages if you want someone to stop using your property.

Yes, restaurant menus are still copyrighted without question. Arguing with you about this is getting very tiresome, I suggest you go read the Wikipedia article on copyright as a starting point, or the other article I posted. Trust me, try reproducing the entire Chili's menu and see what happens. It'll be an interested experiment. Their menu is not in the public domain.

I very much doubt most entrepreneurs believe that it is a safe venture. Most people won't fight you on reproducing things if it benefits them, which is why people do it. But in reality the copyright owners are perfectly within their rights to demand a licensing fee or the removal of the property.


News is one case, but how about the millions ratings & reviews floating out there on the web? What about this site? http://www.boorah.com/restaurants/CA/palo-alto/the-counter/A...

They scrape, re-present abstracts from and supposedly do calculations based on of the entirety of user-submitted data collected by a number of sites.

Does that violate fair use in anyone's opinion?


I will suggest that you mention clearly the origin of anything you get from anywhere on the internet.

The truth is, it’s hard to find a site without the mention of copyright in terms and conditions page.

Overall it will depend of the use you make of the copy content. Trouble starts when you use someone’s material in a line of business that competes with the owners.


Google does it, as do vertical search companies like Indeed and SimplyHired, so it's certainly legal under some conditions. Taking only a snippet and linking back to the source is generally okay (as the original publisher benefits from traffic and SEO juice).


I assume you're talking about Google re-serving cached content. I think Google's defense boils down to "Any site without a properly configured robots.txt file implicitly grants permission to be spidered, cached, and linked to.":

http://www.google.com/support/webmasters/bin/answer.py?answe...

I don't know if this "opt-out" strategy has been tested in court, but I wouldn't assume that wholesale copying of user content is analogous to Google's situation unless the users have some similarly accepted way of opting out of the copying.


Relevant recent discussion on slashdot:

http://yro.slashdot.org/article.pl?sid=08/07/08/1245204


I'm similarly curious about the legality of using screenshots.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: