Hacker News new | past | comments | ask | show | jobs | submit login

> Most web scraping is illegal in the United States.

Got a source on that? Google scrapes all the time. This is how they index all the pages it discovers.

The only real scenario I recall is 3tap vs craigslist but they just kept scraping craigslist even after they banned their IP addresses with multiple proxies.

Then there are airplane ticket websites scraping each other and getting into hot waters.

Having said that it's not a clear cut definition as you'd like to put it. CFAA ruling was only because craigslist felt directly threatened by Padmapper which relied on 3taps.




They didn't claim CFAA on us (PadMapper), and there was definitely no ruling on it (all parties settled). Just for the record.


Are you stating that there was no CFAA claim, or that PadMapper wasn't the involved party, because it was actually 3Taps? The case against 3Taps definitely included a CFAA claim and the judge refused to dismiss it.


You're right that there was a CFAA claim, but there wasn't one made against us.


WOW. so the guys doing all the heavy lifting (3taps) took all the heat in the end. So looks like 3taps is out of business but padmapper is still up and running....getting data from crowdsourcing? It's really odd that if you made this efficient by automating it then it's hacking.

This really is a shitty shitty business model. All that work 3taps did for you guys and they take all the heat? I don't know why 3taps didn't just comply, was PadMapper 100% of their business?


Please don't edit your posts to substantially modify their meaning after someone has replied to you. You make ericd's response look weird now. Reply to the post again if you want to make a different point.


I couldn't reply because I was submitting too fast so instead of replying I added to my original point which was that 3taps took the heat for Padmapper. The fact padmapper didn't get slapped with CFAA, meant 3taps took the major heat and like you are going on about CFAA as being the biggest blunt force, I don't see why it makes his response look weird. He even wrote that padmapper was not the subject of a CFAA, 3taps was. It makes sense that he can't talk in detail about the case for legal reasons.


I hate that HN does that to anyone. It should be reversed only for spam bots and obvious bad faith participants, not someone with an unpopular opinion trying to have a conversation. I've encountered it before too. Sorry that it happened to you. You may want to lodge a complaint with dang so that he understands it's not a good mechanism.

I definitely think that on the outset, it looks weird that 3Taps ended up taking PadMapper's heat, but I think that 3Taps wanted to become a generalized thing-as-a-service vendor. It's possible that PadMapper wasn't 3Taps's only customer for the CL feeds. As PadMapper wasn't contacting CL's computers without authorization, it makes sense that CL had to change the target to 3Taps. At that point, PadMapper would've seen that scraping CL meant a near-impossible legal challenge for a startup and been wise enough not to implement their own solution.

This is all just speculation, but I doubt that 3Taps stuck its neck out for the sole benefit of PadMapper.


I think there is a delay before the "reply" button appears, for posts past a certain nesting level.

I like this feature because it impedes the rapid nesting of conversations, and also allows the author time to edit his reply before anyone can address it.


No worries, just trying to add some nuance. Probably can't share much there.


>Got a source on that?

The source is the CFAA, which makes it a crime and/or a tort to commit any "unauthorized" access to a computer system. Because authorization is not defined in the statute, it's a matter of interpretation whether or not one's use is unauthorized. Historically, judges have strongly disfavored scrapers.

Most boilerplate Terms of Use contain language that forbids all "spiders, scrapers, bots, and all other automated means of access", or something along those lines. Most companies assert that accessing any page beyond the front page of their site constitutes a binding agreement to their ToU, and thus that any automated access is "unauthorized". Scrapinghub appears to be of the opinion that browsewrap agreements are unenforceable, and while some judges have agreed with that, some haven't.

Beyond the argument that scraping is a breach of contract (violating their Terms of Use) and that since you agreed to that contract, you understood that automated access was unauthorized, there's the potential criminal element, which was deployed against weev for exposing a minor data leak in AT&T's system and against Aaron Swartz for exceeding MIT's authorized access to JSTOR and downloading publicly-funded academic data (including data which was out of copyright). You basically just have to really hope that no one inside the company you've "wronged" is good friends with a prosecutor.

Because there is a lot of grey area around what may or may not constitute "unauthorized" access to a computer system, if a company does bring a tort claim against you for accessing their system without authorization, you might actually win -- if you can afford the time and money to fight them for the minimum 3-5 years it'll take your case to resolve. This is hundreds of thousands in legal fees easy.

3taps eventually had no choice but to give up because they couldn't take the legal costs anymore, and Power Ventures tried to stick it out and ended up not only being held liable for $3 million in damages to Facebook's systems when no actual damage had occurred at all, but the veil was pierced and the founder held personally liable. It's obvious from the court documents that he was struggling to afford counsel, and companies must be represented by an attorney, so he didn't even have an option to try to represent himself.

>Google scrapes all the time. This is how they index all the pages it discovers.

Yes, Google's operations are, strictly speaking, illegal on various fronts. They depend heavily on automated access, which many sites they index explicitly forbid and thus Google is committing "unauthorized access" to these computer systems, and they also store complete copies of the site and the individual images displayed on the site, virtually all of which are protected by copyright, and all of which constitutes flagrant violation of copyright law.

If someone did bring a CFAA claim against Google for this (which no one would, because Google is one of the wealthiest companies in the world, and it'd therefore cost tens of millions to sue them), Google would likely argue that robots.txt is the only authorization it is obligated to observe, which may or may not be an effective argument. Google also make no guarantees about the extent to which it obeys robots.txt; it's a way to signal your desires to Google, which it may or may not honor.

tl;dr The very short answer to all of this is that traditionally, the legal system has been extremely suspicious of scrapers and has treated them very badly, applying concepts intended for the physical world like trespass to chattels to server access. This has been improving somewhat in recent years, but is still a very financially and legally precarious situation in which to find oneself. The people who get away with it get away with it because no one sued them before they were too big to sue.


The tricky thing is that a tool or service provider of scraping if compliant to the demands of website owners to stop scraping, there is very little to claim damages. Even if the customer used scrapinghub to login to websites and scrape all the emails, all scrapinghub would need to do is hand over their customer on a silver platter. This is what the DMCA is for. Can you imagine if you manufactured a bicycle and somebody used it to commit a crime? Plausible deniability. Scrapinghub can't monitor everyone's usage all the time to make sure they are following each websites TOS (which are not legally binding).


The DMCA protects service providers from copyright claims for user-generated content as long as they comply with takedown requests, etc. Scrapinghub may have a defense to copyright claims there (though I seriously doubt it due to the nature of their relationship with the customer; they're not a DMCA "safe harbor" and the data they're using isn't user-generated content), but not to CFAA claims.

It's illegal to break the CFAA whether the plaintiff specifically tells you that they think you're doing it or not. If they send a C&D, yes, you'd be wise to comply, but that's not going to absolve you from claims that you harmed their company by violating the CFAA before they sent it (which do happen and are usually claiming a pretty ridiculously silly amount of damages for something as innocent as downloading a web page from their server). You'd have to argue in court that your access was authorized and they'd have to argue that your access wasn't authorized. The judge and/or jury would then evaluate.

3Taps was actually quite similar to Scrapinghub. I don't think they have as much of a defense as you'd like. And Terms of Use are actually usually considered legally binding; to the extent that they're not, it's usually because of something minor like not putting the notice that you agree to the ToU by using the site in plain view.


I think you are overestimating the reach of CFAA. There's multiple web scraping tool/services as a vendor not just ScrapingHub. All of them have been operating longer than 3taps and some do still scrape craigslist and get away with it without issues for the same reason you could hire a guy on freelancer to scrape craigslist for you. 3taps went above and beyond for their best client padmapper and got burned.


>I think you are overestimating the reach of CFAA.

I don't think so. The CFAA states:

>Whoever intentionally accesses a computer without authorization or exceeds authorized access, and thereby obtains information from any protected computer shall be punished as provided in subsection (c) of this section. (a)(2)(C)

It defines a "protected computer" as:

>...the term "protected computer" means a computer which is used in or affecting interstate or foreign commerce or communication, including a computer located outside the United States that is used in a manner that affects interstate or foreign commerce or communication of the United States; (e)(2)(B)

As the Supreme Court has ruled that virtually anything in the United States is subject to the Commerce Clause, this comprises practically all computers, especially after you consider that usage of a computer network almost certainly takes your traffic out of state. Many states have corollary laws to the CFAA with substantially similar language, so if you can miraculously convince a judge that the computers involved are not part of interstate commerce and that the feds therefore have no jurisdiction, there's a good chance you'll have to contend against a similarly-worded state statute.

I don't see any limitations or exceptions here. If you are accessing a computer in an "unauthorized" manner and obtain information whilst doing so, you have violated the CFAA.

The reason scraping can happen is a combination of lack of technical awareness (both from lawyers about computers and from programmers about law) and the cost of pursuing a lawsuit. Even if you break the law, someone has to take issue with your law-breaking before anything happens; they have to file either a lawsuit or an indictment to get the ball rolling. That some people are able to get away with violating the CFAA without someone registering a formal complaint on the matter has nothing to do with whether or not one has violated the statute.

The only way that scrapers don't violate the CFAA is a liberal interpretation of the term "unauthorized", wherein a judge states that if a computer is advertising and allowing public access, then all members of the public are inherently authorized to access it. I know that several scrapers have taken their cases through the courts hoping that such an interpretation would be given.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: