Web Scraping in 2016

minimaxir · on Aug 23, 2016

Keep in mind that companies have sued for scraping not through the API, for example LinkedIn, which explicitly prevents scraping via the ToS: http://www.informationweek.com/software/social/linkedin-sues...

OKCupid did a DMCA takedown for researchers releasing scraped data: https://www.engadget.com/2016/05/17/publicly-released-okcupi...

Since both of these incidents, I now only scrape if it's a) through the API following rate limits or b) if there is no API, and the data has the explicit purpose of being shared publically (e.g blogs), I follow robots.txt. Of course, most companies have a do-not-scrape clause in their ToS anyways, to my personal frustration.

(Disclosure: I have developed a Facebook Page Post Scraper [https://github.com/minimaxir/facebook-page-post-scraper] which explicitly follows the permissions set by the Facebook API.)

franciskim · on Aug 24, 2016

On the ethics side, I don't scrape large amounts of data - eg. giving clients lead gen (x leads for y dollars) - in fact, I have never done a scraping job and don't intend to do those jobs for profit.

For me it's purely for personal use and my little side projects. I don't even like the word scraping because it comes loaded with so many negative connotations (which sparked this whole comment thread) - and for a good reason - it's reflective of how the the demand in the market. People want cheap leads to spam, and that's bad use of technology.

Generally I tend to focus more on words and phases like 'automation' and 'scripting a bot'. I'm just automating my life, I'm writing a bot to replace what I would have to do on a daily basis - like looking on Facebook for some gifs and videos then manually posting them to my site. Would I spend an hour each and every day doing this? No, I'm much more lazier than that.

Who is anyone to tell me what I can and can't automate in my life?

ccvannorman · on Aug 24, 2016

This is exactly my response to "you can't legally scrape my site because of TOS." I don't think anyone has a legal right to tell me HOW I use their service. Making "browsing my website using a script you wrote yourself" illegal is akin to "You cannot use the tab key to tab between fields on my website, you must only use the touchpad to move the cursor over each field individually."

It's baloney.

cookiecaper · on Aug 24, 2016

Under the CFAA, they do have the right to determine what constitutes authorized access. If they say you're unauthorized for using the wrong buttons on your mouse, then you're unauthorized. It's treated very similarly to trespass on private land.

You can try telling the judge it's baloney, but if he's going by current precedent, he probably won't agree with you.

ccvannorman · on Aug 24, 2016

It is not the same flavor of violation as trespassing on land, more like riding a bicycle on a sidewalk. Fortunately, I'm not in the business of scraping sites, but I still find this legal precedent abhorable, and I hope it gets struck down in court when push comes to shove. I would certainly vote that way if given the chance.

Dayshine · on Aug 24, 2016

It's like riding a bike in a skate-park with a small sign saying "No bikes".

TeMPOraL · on Aug 24, 2016

It's like riding a red mountain bike on a bike path with a small sign saying "only black road bikes allowed".

lerpa · on Aug 24, 2016

> Who is anyone to tell me what I can and can't automate in my life?

You are exactly right. But although a site can deny you access for any arbitrary reason (it's their website, after all) obviously government think they are the ones to enforce this crap.

What if the ToS say you can only access a site while jumping hoops? Only read the ToS after a while and wasn't hooping? Well too bad, now you are being sued for reading the main page _and_ the ToS page without jumping around.

This comment Terms of Service: If you read any of this text you owe lerpa $1.000.000 to be paid up until 09/01/2016.

tangue · on Aug 23, 2016

It would be ok if it wasn't "You can't scrap my site. Unless of course you're Google" this double standard drives me mad.

sjwright · on Aug 24, 2016

As the owner of a large website, I don't care what you think. I block by default and whitelist when I decide it's in my interest.

If you don't think this is reasonable, chances are you've never run a large website, or analyzed the logs of a large website. You'd be astonished how much robotic activity you'll receive. If left unchecked it can easily swamp legitimate traffic.

Unless you have a way for me to automatically identify "honourable" scrapers such as yourself as distinct from the thousands upon thousands of extremely dodgy scrapers from across the world, my policy shall remain.

bigtunacan · on Aug 24, 2016

As the user of large websites I don't care. I'm not going to read the TOS and I will continue to scrape what I like since it makes my life more convenient. Like OP when blocked I'll just drive my scraping through a web browser which is the same as I've done for years on various sites that never provided APIs.

louprado · on Aug 24, 2016

"As the user of large websites I don't care". Are you sure ? Do you want your OK Cupid or LinkedIn profile to be crossposted on another website without your knowledge.

lerpa · on Aug 24, 2016

If you don't want your data public, then don't make it public in the first place. That's a good rule of thumb.

vidarh · on Aug 24, 2016

Putting it behind a signup page with terms that don't allow sharing is not "making it public".

And while in the US that may "just" be treated as unauthorized access, in the EU, if you make the data public it's also a violation of the Data Protection Directive, putting you at risk of prosecution in every EU country from which you have included data.

You may be right from a risk minimisation perspective. But for a lot of data the risk in the case of exposure is low enough that it is a totally valid risk management strategy to assume that legal protections will be a sufficient deterrent to prevent enough of the most blatant abuses.

kuschku · on Aug 24, 2016

Eh, not really. The Data Protection Directive doesn’t even apply here – if the first party (OKCupid) made it available to a third party (the scraper), then the first party can be held in violation, but not the third party.

Silhouette · on Aug 24, 2016

If you have control of personally identifiable data, it's likely that at least some of the EU data protection rules will apply to you regardless of how you got it.

tripzilch · on Aug 24, 2016

Yes but as you say, they apply regardless. More specifically, they apply to data that you have (and are storing), not the act of obtaining it.

As a private individual it's not hard to comply either, for private use. If you publish it, it becomes a different story, because it's PII. And, as soon as it's in possession of a company, they need to comply with more rules about securely storing it, etc. (this isn't enforced very well, though). Private individuals can't be held to that because there's (in theory) no legal way to check it.

Analemma_ · on Aug 23, 2016

Why is it a double standard? Google scraping usually benefits the site with increased traffic and revenue, in a way most other scraping does not. Saying "you can scrape me if it benefits me" isn't totally in keeping with the principles of the open web, but it's not hypocritical.

nmstoker · on Aug 23, 2016

With a risk of stating the obvious, this is a double standard simply because there are two standards - one for Google and one for others. I can't speak for the poster you were replying to, but whilst I see it as logical self-interested behaviour by site owners, it still feels unfair.

bendbro · on Aug 23, 2016

There isn't: the function for this standard includes expected benefit as an input. Every standard has inputs, so that certainly isn't the quality for making something a double standard. The only remaining quality is how unfair it feels, so it would probably be better to just address that, since it is obviously the only thing you disagree about.

lugg · on Aug 24, 2016

With that logic decreased wages for women are not a double standard due to the potential for maternity leave affecting their output at work.

This is a double standard plain and simple, and a very dangerous one at that.

sp527 · on Aug 24, 2016

Your example is a case of discrimination, but the economic rationale is unquestionable. There is a tremendous upfront cost for new employees, who are not valuable contributors for some lengthy ramp up period and furthermore accrue experience over the course of employment. So the lifetime value curve for any given employee is typically skewed left.

bendbro · on Aug 24, 2016

I think I'm just being difficult.

My point was that when you call something a double standard, you're arguing two things of equal value have been judged differently under the same standard. But by acknowledging they've been judged differently, you're acknowledging that there is a judgement, a standard, that applies the same to both, and produces the results you object to. What you really object to is the fairness of the qualities checked by the standard.

Since the outcome of calling things that, vs calling them a double standard is the same, I think most people already know and have no trouble with this. My protests were worthless.

It could gain value if there were certain whitelisted judgable aspects (like expected value), and judgements that aren't based on things from the whitelist are considered outside the scope of a standard. Then, calling the standard unfair and calling it a double standard would have a different meaning (if only in some contrived way, since any aspect is just an argument away from the whitelist)

lerpa · on Aug 24, 2016

It's their site they can block whatever they want. The problem is the stupid far reaching conclusion that this is trespassing.

Even normal trespassing laws are way too overreaching (see how it is handled in the UK for a saner example) but now you have the amazing possibility of remote trespassing.

The fun part is that it's just a matter of someone hiding something that says you cannot access the site in a place that you have to access the site in order to read -- the ToS. Suing people over this is idiotic.

The real problem is the involvement of Govt, and this kind of absurdity regarding ToS, EULAs and so on, is something that has been going on for decades. If you have the money you can make Govt your personal watch dogs.

mgalka · on Aug 23, 2016

Whether or not it technically qualifies as a "double standard," in practice I don't see anything inherently unfair about it.

If a stranger enters my house without my permission, that's trespassing. But there's nothing unfair about letting in someone who I invite over.

nercht12 · on Aug 24, 2016

That's a terrible analogy. Your home is private, websites are not. The fact is that websites are posted online for all to see, so it's more like saying certain people at a park may take pictures while others are not allowed. That's unfair. If everyone could take pictures, it would be fair. Yes, someone with an old bright bulb camera might be annoying people, but nobody said "fair" meant all players would be nice or that having a "fair" policy would somehow be more beneficial to the website owner. It's not, that's why site owners are selective. So they have a double standard, but it's for their benefit, not that of the site visitors (be they human or bot).

sjwright · on Aug 24, 2016

How about the analogy of an art gallery disallowing photography? Is the gallery being hypocritical when they allow the local paper to take photos for publicity, or when they permit an archivist that has a known reputation to take photos for archival purposes?

Grishnakh · on Aug 25, 2016

You can still deal with the old bright bulb cameras: you can have rules which apply to everyone. So you can make a rule at the park that pictures are allowed, but only without flash, or that only digital cameras are allowed, or only digital cameras with the fake-shutter noises turned off, etc. As long as the rule applies to everyone equally, it's fair, even if you think the rule is silly.

For websites, it's not fair to have different rules for Google than others. What would be fair is some kind of rule about how often visitors can visit, how much they're allowed to download, etc.

Personally, though, I think all this is total BS. Sites are open to the public, but they also serve the whims of their owners. If the site wants to prevent access to people from a certain IP range, that should be their right. If they don't want any scrapers, that should be their right too, or if they want to allow Google and not anyone else, that should also be their right. What isn't right is that they can use the government to enforce these arbitrary rules. If they want to block my scraper, that's fine, if they can do it on their end technologically. If they want to block my IP, they can do that too. But suing me or having the cops come to my door because they're too incompetent or lazy to do these things technologically is unacceptable. The role of government is not to enforce arbitrary policies made up by business owners.

gakada · on Aug 24, 2016

Using the law to block crawlers is more like saying:

1. Google can come in

2. Other Americans can't come in

3. Chinese people can come in (or anywhere else where US laws don't apply)

It might not be unfair, but it is certainly pointless and arbitrary.

paisawalla · on Aug 24, 2016

To be fair, many companies which take anti-scraping seriously will also take inputs like geographic origin of a request into consideration when applying request throttling and filtering.

tangue · on Aug 24, 2016

Google is basically algorithms built on top of a scraping service. It's unfair to competitors (and potential disruptors) to restrict access to data that Google can fetch without limits.

collinmanderson · on Aug 24, 2016

Maybe we (scrappers) just need to market ourselves as search engines. _Indexing_ is what we're doing. :)

lerpa · on Aug 24, 2016

Exactly, market scrappers as search engines.

And all smart websites should include a ToS that says you are not allowed to access their data, so they can sue for trespassing anyone that they don't like selectively.

The far reaching of government into this, and also the pirating stuff (which I do not condone but think that arresting people for that is waay too much) is what makes me want for the system to collapse under it's own weight. Like some website suing members of congress for visiting it while violating the ToS in this case.

I also secretly wanted Oracle to win vs Google so that cloning an API was piracy and that would extend to being a crime to purchase pirated goods which would make all clean room reverse engineering a criminal activity. That would lead to anyone that uses a PC without an authentic IBM BIOS (look up Phoenix BIOS) to be arrested, in theory, so even the US president would have to fall into that. It would have been a glorious shitstorm if Oracle won and IBM took that precedent to it's logical implications, the computer world would have failed, and the law would either be made even more arbitrary or be fixed, but at least it would be shown how idiotic the state of affairs was.

Grishnakh · on Aug 25, 2016

Your idea about Oracle winning and society coming crashing to a halt is ridiculous and wouldn't have happened. Your flaw is believing that the law and the government will work with logical precision, so that a flaw in the law will, like an infinite recursive loop in programming code, cause complete disaster. It doesn't work that way. There's plenty of cases where the law is clearly broken (see civil forfeiture vs. the 4th Amendment to the US Constitution), yet nothing is done. That's because the government is run by humans, and they'll enforce things the way they want. Double standards happen all the time with law, and it takes big, expensive court cases to sort them out, and of course that only happens when some moneyed interest wants to fix it (which is why civil forfeiture is still a big thing--they're not going after extremely wealthy people or corporations with it). While IBM is certainly large enough to bring a big case like you suggested, the US government is far bigger and can simply invent a legal way of ignoring them, just as was done when the SCOTUS decided to rule in favor of using Eminent Domain to seize private property to hand over to commercial interests.

profeta · on Aug 24, 2016

because i may also come to the point where i am a direct competitor to google, but i will never get there because i can't scrap any site like they can.

your next argument may very well be a very racist one with the very same excuse you used above.

sjwright · on Aug 24, 2016

And if you have some way to identify yourself as a potential competitor to google and not some jackass trying to scrape email addresses or spam comments forms, I'm all ears.

wumpus · on Aug 24, 2016

A majority of the websites that blekko, a google competitor, contacted to ask for robots.txt access ignored us.

sjwright · on Aug 24, 2016

I agree, it's a difficult conundrum. It sucks.

wumpus · on Aug 24, 2016

There are worse barriers to entry for a search engine! DMCA take-downs... Right to be forgotten... click history...

_c_ · on Aug 24, 2016

Worse is that Google tries to stop scraping. It's like they don't want anyone to see past the first page of results.

They could scrape your website and then they prevent you form scraping your own data back.

The whole process is silly; it reflects the duct tape and chicken wire nature of the www.

No one should have to "scrape" or "crawl".

Data should be put into a open universal format (no tags) and submitted when necessary (rsynced) to a public access archive, mirrored around the world.

This to bridge the gap until we reach a more content addressable system (cf. location based).

Clients (text readers, media players, whatever) can download and transform the universally formatted data into markup, binary, etc. -- whatever they wish, but all the design creativity and complexity of "web pages" or "web apps" can be handled at the network edge, client-side.

"Crawling" should not be necessary.

No one should have to store HTML tags and other window dressing for data.

Dream on.

NamTaf · on Aug 24, 2016

That's the antithesis of the world wide web because you've just centralised data storage, which makes someone 'own' the www.

_c_ · on Aug 24, 2016

I do not understand your argument.

To give an example, there is a lot of free open source software mirrored all over the internet, mostly on ftp servers, but also on http, rsync, etc.

If you use Linux or BSD you probably are using some of this software. If you use the www, then you are probably accessing computers that use this software. If you drive a new Mercedes you are probably using some of this software. There are a lot of copies of this code in a lot of places.

Is that centralized? Does anyone hosting a mirror ("repository") "own" the software? Is it the same person or entity hosting every mirror?

Compare Google's copies of everyone else's data, also replicated in a lot of places around the world. Who "owns" this data?

codeddesign · on Aug 23, 2016

Double standard? The difference is that Google Bot is built on being unobtrusive. I can easily built a scraper that will quickly ddos a site. Linkedin for example...if they allow 10,000 people to send 100 scraping requests per second everyday then that is stolen bandwidth that Linkedin has to pay for and the scrapers get free data. The difference is that Google has standards in which site's unusually benefit from, not to mention that they allow for you to disallow their bot. It just doesn't work the same way with some random developer building a scraper.

encoderer · on Aug 24, 2016

I agree that Googlebot is well behaved. When it detects your site is slowing down, it will back itself off. Unfortunately, this is often to your detriment.

In my experience, on a large site, Google will often slurp as much as you let it, upwards of hundreds of pages per second.

Guest98123 · on Aug 24, 2016

Google usually does 300 pages a minute on my site. In total, bots were loading about 1,000 pages a minute.

encoderer · on Aug 24, 2016

That is a lot! It's also still an order of magnitude less than big content sites. Not taking anything away from what must be a successful website to get a consistent 300 pages/minute crawl rate, but only to illustrate magnitude.

Guest98123 · on Aug 24, 2016

I was curious, so I just checked the stats through webmaster tools. For the last 90 days, the low is 450,000 daily crawled pages, average is 650,000, and yesterday was the high of 1,130,000 (780 per minute). Ouch.

encoderer · on Aug 24, 2016

Have you seen correlation between rankings and crawl rate?

Guest98123 · on Aug 24, 2016

This particular site is top 5,000 Alexa. The content changes every minute, and Google is fast at picking up those changes. The last cache of the homepage was 7 minutes ago from Google.

There's definitely a correlation between my sites' Google rankings, their organic traffic, and their crawl rate. The other sites I run are Alexa top 30,000 and top 100,000. They all feature dynamically changing content, but Google is definitely using a higher crawl rate on my higher ranking sites. This isn't a surprise though, Google has limited resources like everyone, and they'll focus those resources in a way that provides the most benefit.

Edit: If you're talking about the correlation between daily ranking and daily crawl rate for an individual site, then no, I'm not aware of any patterns. For example, the graph is flat for organic traffic and total indexed pages, but the crawl rate jumps up and down as mentioned, and it doesn't appear to relate on a daily basis.

encoderer · on Aug 24, 2016

I've seen rankings drops following drops in crawl rate.

desireco42 · on Aug 24, 2016

Google and others, with legitimate reasons obey robots.txt

hobs · on Aug 23, 2016

This post is kind of crazy, aggrandizing bad behavior and misuse of other's resources against their will.

Scraping against the TOS is super bad netizen stuff, and I dont think people should be posting positive reviews of people doing this. Breaking captchas and the like is basically blackhat work and should be looked down upon, not congratulated as I see in this thread.

madamelic · on Aug 23, 2016

>Scraping against the TOS is super bad netizen stuff, and I dont think people should be posting positive reviews of people doing this. Breaking captchas and the like is basically blackhat work and should be looked down upon, not congratulated as I see in this thread.

Not really.

Scraping, in my opinion, isn't black hat unless you are actually affecting their service or stealing info.

If you are slamming the site with requests because of your scraping, yeah you need to knock it off. If you throttle your scraper in proportion to the size of their site, you aren't really harming them.

In regards to "stealing info", as long as you aren't taking info and selling it as your own (which it seems OP is indeed doing), that is just fine.

tl;dr: Scraping isn't bad / blackhat as long as you aren't affecting their service or business.

muglug · on Aug 23, 2016

> If you throttle your scraper in proportion to the size of their site, you aren't really harming them.

And do you understand their site infrastructure to know whether you're doing harm? It's perfectly possible that your script somehow bypasses safeguards they had in place to deal with heavy usage, and now their database is locking unnecessarily.

cookiecaper · on Aug 23, 2016

Eh, this is pretty weak. Scrapers are no different from other browsing devices. The web speaks HTTP. There's no reason that using another HTTP browser would cause any disparate impact just by virtue of not being a conventional desktop browser -- you've thrown out a pretty absurd hypothetical. In fact, scrapers usually cause less impact because they usually don't download images or execute JavaScript.

I did an analysis and a session browsed with my specialized browser would always consume less than 100K of bandwidth (and often far less), whereas a session browsed with a conventional desktop browser would consume at least 1.2 MB, even if everything was cached, and sometimes up to 5 MB. In addition, on the desktop, a JavaScript heartbeat was sent back every few seconds, so all of that data was conserved too.

Because we were a specialized browser used by people looking for a very specific piece of data, we could employ caching mechanisms that meant that each person could get their request fulfilled without having to hit the data source's servers. We also had a regular pacing algorithm that meant our users were contacting the site way less than they would've been if they were using a conventional desktop browser.

Our service saved the data source a large amount of resource cost. When we were shut down, their site struggled for about two weeks to return to stability. I think they had anticipated the opposite effect.

Our service also saved our users a large amount of time. We were accessing publicly-available factual data that was not copyrightable (but only available from this one source's site). There's no reason that the user should be able to choose between Firefox and Chrome but not a task-specialized browser.

It is true that some people will (usually accidentally) cause a DDoS with scrapers because the target site is not properly configured, but the same thing could be done with desktop browsers. It doesn't mean that scrapers should be disadvantaged.

hchasestevens · on Aug 23, 2016

A small counterpoint to this -- in the airline industry, it's relatively commonplace for seat reservations to be made for a user _before_ payment has occurred. In this case, if you're mirroring normal browser activity, you can (temporarily) reduce availability on a flight, potentially even bumping up the price for other, legitimate users, and almost certainly causing the airline to incur costs beyond normal bandwidth and server costs. I'm sure there are many other domains for which this is also the case, however rare.

gnud · on Aug 23, 2016

If they don't do the seat reservation behind a POST, or at least blacklist the reservation page in robots.txt, I have no sympathy.

Steeeve · on Aug 24, 2016

I've had this happen regularly enough that I developed the habit of finding a fare in the morning and then returning at 11pm ready to buy.

Chris2048 · on Aug 24, 2016

Cinemas and some online shops do the same. I've always wondered if it's possible to block any tickets sales for entire flights/screenings/products this way.

And if airline tickets are based on supply v demand, it might even be possible to drive down ticket prices by suddenly dropping a load of blocks near to the flight date.

angry_octet · on Aug 24, 2016

If you have ever tried to buy hot tickets online and not been able to get any, this is due to bots. Bots are the scalpers friend.

This can easily be prevented by requiring ID matching the ticket on entry, but the ticket sellers often don't seem to care.

muglug · on Aug 23, 2016

> you've thrown out a pretty absurd hypothetical

Not even remotely absurd. Where is the data your scraper consuming coming from? It's almost always served from some sort of data repository (SQL or otherwise). That data costs far more per MB to serve up quickly than JS/CSS/images.

Suppose, for example, you host a blogging platform that has one very popular user. Most accounts on your site don't get a ton of visitors, and that one very popular user's post are all stored in cache.

Then along comes a scraper. He thinks, "Hey, this site is serving up a million page impressions a day. It can definitely handle me scraping the site".

But when he runs the scraper, he fills up the cache with a ton of data that it doesn't need, causing cache evictions and general performance degradation for everyone else.

cookiecaper · on Aug 23, 2016

There are already 6-8 major scrapers that do this constantly, across the whole internet, called search engines. You can't handle that?

What if you get a normal user who says "Hey, I wanna see some of the lesser known authors on this platform" and opens up a hundred tabs with rarely-read blogs? What if you get 10 users who decide to do that on the same day? Is it reasonable to sue them? Should there be a legal protection to punish them for making your site slow?

Don't blame the user for your scaling issues. If the optimized browser ("scraper") isn't hammering your site at a massively unnatural interval, it's clean. And if it is, you should have server-side controls that prevent one client from asking for too much data.

These are just normal problems that are part of being on the web. It's not fair to pin it on non-malicious users, even if they're not using a conventional desktop browser.

muglug · on Aug 23, 2016

Search engines respect robots.txt – not sure many scrapers do.

cookiecaper · on Aug 23, 2016

First, search engines are scrapers. No need to make a distinction.

Second, search engines don't always respect robots.txt. They sometimes do. Even Google itself says it may still contact a page that has disallowed it. [0]

Third, robots.txt is just a convention. There's no reason to assume it has any binding authority. Users should be able to access public HTTP resources with any non-disruptive HTTP client, regardless of the end server's opinion.

[0] "You should not use robots.txt as a means to hide your web pages from Google Search results. This is because other pages might point to your page, and your page could get indexed that way, avoiding the robots.txt file." / http://archive.is/A5zh8

nostrademons · on Aug 23, 2016

In the Google quote you link to, Google is not contacting your page. Rather, Google will index pages that are only linked to, which it has never crawled, and will serve up those pages if the link text matches your query. That's how you get those search results where the snippet is "A description of this page has been blocked by robots.txt" or similar.

There's a somewhat related issue where to ensure your site never exists in Google, you actually need to allow it to be crawled, because the standard for that is a "<meta name=noindex ...>" tag, and in order to see the meta noindex, the search engine has to fetch the page.

hobs · on Aug 23, 2016

And the original point of my comment was that doing this is extremely rude and not appropriate, not that it couldn't be done or that others weren't doing it.

Feel free to send any request to any server you want, it is certainly up to them to decide whether or not to serve it, but that doesnt absolve you of guilt from scraping someone's site when they explicitly ask you not to.

cookiecaper · on Aug 23, 2016

Please don't conflate "extremely rude", "not appropriate", and "guilt". Two of these are subjective opinions about what constitutes good citizenship. The last one is a legal determination that has the potential to deprive an individual of both his money and liberty. We're discussing whether these behaviors should be legal, not whether they are necessarily polite.

hobs · on Aug 24, 2016

I never did.

You are posting in a comment thread underneath my reply about rudeness and impoliteness, ironically being somewhat rude telling me off about what not to conflate when it was never what I said.

tedunangst · on Aug 23, 2016

Google will put forbidden pages in its index. It doesn't scrape them. (The URL to the page exists even without visiting the page.)

Animats · on Aug 23, 2016

We do, and we also use our own user-agent string: "SiteTruth.com site rating system". A growing number of sites reject connections based on USER-AGENT string. Try "redfin.com", for example. (We list those as "blocked"). Some sites won't let us read the "robots.txt" file. In some cases, the site's USER-AGENT test forbids things the "robots.txt" allows.

Another issue is finding the site's preferred home page. We look at "example.com" and "www.example.com", both with HTTP and HTTPS, trying to find the entry point. This just looks for redirects; it doesn't even read the content. Some sites have redirects from one of those four options to another one. In some cases, the less favored entry point has a "disallow all" robots.txt file. In some cases, the robots.txt file itself is redirected. This is like having doors with various combinations of "Keep Out" and "Please use other door" signs. In that phase, we ignore "robots.txt" but don't read any content beyond the HTTP header.

Some sites treat the four reads to find the home page as a denial of service attack and refuse connections for about a minute.

Then there's Wix. Wix sometimes serves a completely different page if it thinks you're a bot.

DashRattlesnake · on Aug 23, 2016

> I did an analysis and a session browsed with my specialized browser would always consume less than 100K of bandwidth (and often far less), whereas a session browsed with a conventional desktop browser would consume at least 1.2 MB, even if everything was cached, and sometimes up to 5 MB. In addition, on the desktop, a JavaScript heartbeat was sent back every few seconds, so all of that data was saved too.

Bandwidth is certainly part of it, but there's also also database and app-server load (which may be the actual bottleneck) that a scraper isn't necessarily bypassing.

cookiecaper · on Aug 23, 2016

Yeah, I just have a hard time buying that a scraper that does less than a conventional desktop browser is going to accidentally stumble across something that causes the server-side to flip out. I'm not really sure in what case your hypothetical is plausible.

Scrapers are usually used to get publicly-available data more efficiently. What you're describing would basically require the scraper to hammer an invisible endpoint somewhere, but there's no reason the scraper would do that -- it just wants to get the data displayed by the site in a more efficient manner. I suppose the browser could enforce a cooldown on an expensive callback via JavaScript, which a scraper would circumvent, but IMO that's not a fair reason to say scrapers are disallowed; cooldowns should be enforced server-side. There's no way to ensure that a user is going to execute your script. That's just part of the deal.

Everything about scrapers means less server load; no images, no wandering around the site trying to find the right place, no heavy JavaScript callbacks that invoke server-side application load, etc. Scrapers are just highly-optimized browsing devices targeting specific pieces of data; it's logical that they would be cheaper to serve than a desktop user who's concerned about aesthetics and the like.

In our specific case, those JavaScripts we didn't download included instructions to make over 100 AJAX requests on every page load. No wonder users were looking for something more efficient.

So I agree that a scraper isn't necessarily bypassing some load-heavy operations, but I find it highly implausible that a non-malicious scraper would be invoking operations that cause extra load (beyond just hitting the site too often). Frankly, I'd be surprised if there was a functional scraper that regularly invoked more resource cost per-session than a typical desktop browsing session to get equivalent data.

DashRattlesnake · on Aug 23, 2016

> What you're describing would basically require the scraper to hammer an invisible endpoint somewhere

That wasn't my point. My point was: a lot of a website's costs are hidden from a web scraper (e.g. database load), so a scraper can't claim, based on the variables they can observe (bandwidth), that they're costing the website less than normal traffic.

I was basically responding to statements like this:

> In fact, scrapers usually cause less impact because they usually don't download images or execute JavaScript.

There's really no way for a scraper to know that unless the website tells them. Their usage pattern is different than typical users and raw bandwidth (for stuff like static images) may not matter to the website.

cookiecaper · on Aug 23, 2016

It's true that there's no way to know that for sure, but it doesn't make sense that a scraper, by virtue of its being a scraper, is incurring additional load. A scraper is only making requests that a person with a desktop browser or any other appliance that speaks HTTP could make. What's the difference between a user clicking the same button on the page 50 times or holding down F5 and a scraper that pings a page once a minute?

Your argument is basically boiling down to "scrapers could hit one load-heavy endpoint too fast", but so could desktop browsers. So I don't see what it has to do with scraping.

DashRattlesnake · on Aug 23, 2016

> but it doesn't make sense that a scraper, by virtue of its being a scraper, is incurring additional load

It does, because scrapers don't have normal usage patterns. They're robots and behave like robots.

> What's the difference between a user clicking the same button on the page 50 times or holding down F5 and a scraper that pings a page once a minute?

Typical users aren't usually in the habit of mashing F5, especially not for robotically long periods of time. It's basically the difference between a theoretical activity and an actual activity.

Basically, scraping is not regular usage, and I don't think it's correct to pretend that they're equivalent (or more extremely, that scraping is less costly to the website).

cookiecaper · on Aug 23, 2016

Scrapers are usually coded to have as regular of a usage pattern as possible, so that the data they retrieve is as much like the data the end user would receive as possible.

For example, Googlebot does everything in its power to ensure that it sees pages the same way that end users sees them, executing JavaScript and performing OCR to try to read information conveyed in images. Google also has non-Googlebot scans to try to determine if a page is serving different content to Googlebot-labeled scans, and they penalize sites that they suspect of doing this.

While it is true that someone could write a scraper that obviously behaved robotically, it is also true that someone could use their desktop browser in a robotic way. Mashing F5 is so common that there are many ancient memes referring to and making jokes about that activity. There are extensions that end users use to record browser macros, behaviors they want their browser to repeat over and over again.

However, this conversation about whether scrapers behave robotically or not is moot because a web site shouldn't break down under load when someone uses it in a slightly-irregular way. The obvious, crappy scrapers are trivial to block. The ones that blend into the traffic are no harm, no foul. If you can't tell the difference between an optimized browser like a scraper and a general-purpose browser like Chrome, why shouldn't it be allowed to talk to your site?

kuschku · on Aug 24, 2016

> Typical users aren't usually in the habit of mashing F5, especially not for robotically long periods of time. It's basically the difference between a theoretical activity and an actual activity.

Just like every university site ever is completely down during signup days because everyone is mashing F5.

Link me your site, I’ll treat it like a college student waiting to be able to sign up for their classes.

chucksmash · on Aug 23, 2016

Have run into exactly this before. Wrote a scraper that retrieved results from a trivia league website. Tried to be a polite scraper (<1 request per second) but the site still crashed - even with 5 seconds of sleep between requests. They were doing something weird with DB connection management (maybe just forgetting to close it and letting it timeout? I remember figuring it out but it's been quite a while) and so after N very reasonably spaced queries the site would reproducibly start throwing an uncaught MAX_DB_CONNECTIONS_EXCEEDED and just be down for everybody everywhere who might've wanted to use it.

tshaddox · on Aug 23, 2016

It seems like you could easily hit those scaling issues by manually browsing the website. While I agree that it sucks to take down a site by scraping, in that specific case it sounds like the performance issues are their fault and not yours. That said, once I realized the effect my scraping had, I would (hopefully) cease my scraping.

chucksmash · on Aug 24, 2016

So the thing is, I could totally believe they never saw this traffic pattern under normal load. I'd expect bar trivia scores in a certain mid-sized US city are one of those niche things where you have a very low number of uniques but each unique then pokes around on 9 or 10 pages while they're there. The fact that the site didn't crash during normal browsing was what originally led me to speculate they were maintaining an open DB connection per session. If that was indeed the issue, I could totally imagine they'd only rarely (never?) had 100+ "concurrent..ish" unique visitors.

Grishnakh · on Aug 25, 2016

Ok, then why couldn't you revise your scraper so that it did everything in a single session, to avoid this problem?

To me, for private, personal use, a scraper should emulate a normal human browser as much as possible to avoid causing site problems and to avoid detection. If what you're doing can be done in the background, or by a cron process at some odd hour, it doesn't have to be fast at all, and you can set the timings to be similar to a normal human.

chucksmash · on Aug 23, 2016

Now that I think about it a bit more, I think my hypothesis was that DB connections were allocated at the session level and that without cookies enabled each request initiated a new session.

I'd consider that a bug not a feature but I still think it's incumbent on me, the guy scraping the website, not to trigger it.

jdc0589 · on Aug 23, 2016

That is a classic connection pooling/lifecycle bug, and usually one that gets caught in the first few days of having multiple people utilizing a product/service, worst case.

If someone's production site, thats been around for while, had a bug like this that can be caused by what you describe, I'd love to see how many real users they have. I'm sure its possible under certain circumstances, but its definitely bad engineering that would be caused by literally any traffic.

cookiecaper · on Aug 23, 2016

You can avoid triggering this in your scraper by activating a cookie jar. Pretty simple most of the time. Even commandline cURL and wget support it. I'm sure you figured that out already, but just for anyone who's wondering. ;)

That said, while obviously you want to avoid triggering the bug since it offlines your data source, this is definitely in the site's court to fix and could easily be triggered by normal usage. Some people browse with cookies disabled, especially since the EU passed its "cookie law", requiring sites to get consent before storing a cookie on visitors' machines. If you've started to notice more sites talking about cookies over the last year, that's why. [0]

[0] http://ec.europa.eu/ipg/basics/legal/cookies/index_en.htm

flukus · on Aug 23, 2016

>Now that I think about it a bit more, I think my hypothesis was that DB connections were allocated at the session level and that without cookies enabled each request initiated a new session.

Could also be something like storing hibernates second level cache in session. Unfortunately I've seen this, a significant chunk of the database was being copied into each users session.

bigiain · on Aug 24, 2016

I mostly agree with the post's author on the "I'm just automating something I'd otherwise be doing manually". If the local weather service publishes, say, barometric charts on their site - but has a TOS that prohibits me from scraping, and my alternative was to just hit their site every day and right-click and save-as on the chart - I feel absolutely no compunction in automating that. You need to be careful of the slippery slope though, when it's easy to grab every day's local barometric chart, it become too easy to think "Hey, I just need to stick that in a loop and I can grab 1000 different charts every day!". I'd personally _not_ do that. If it's something I'm likely to do "by hand" but would occasionally miss a day or three, I'll automate it no matter what the TOS says.

antisthenes · on Aug 23, 2016

You're saying the site should pay him a consulting fee for the free load-testing service he provides?

wwweston · on Aug 23, 2016

One fair baseline is whether or not the custom User Agent you're using to scrape has request timing that's on the order of what a fast human visitor might do. If the site can't handle that, it's certainly not the UA's fault.

sjwright · on Aug 24, 2016

> Scraping, in my opinion, isn't black hat unless you are ... stealing info.

And as a webmaster, how can I tell the difference before it's too late?

minimaxir · on Aug 23, 2016

> tl;dr: Scraping isn't bad / blackhat as long as you aren't affecting their service or business.

Analyzing data that you're not allowed to access gives you/your company a competitive advantage, which is affecting their service/business even if it's not posted/distributed publically.

Nicksil · on Aug 23, 2016

I don't follow your argument. How does one get their scraper access to data they would otherwise not be able to access through 'normal' browsing techniques?

Karawebnetwork · on Aug 23, 2016

Example I know of: You can scrape your competitor's Facebook pages since their creation and output nice graphs of which posts generated what kind of likes and suscriptions. This data is usually limited to the owner of the page.

wongarsu · on Aug 23, 2016

But I can get the same data with manual browsing. And I can just as well pay some chinese to collect the data and input it in my database.

Automated scraping is just a way to drastically reduce labor costs for information collection. Sure, it's a competitive advantage, but I think disallowing it or calling it unethical is a pretty big can of worms. Why is it ok if something is done by humans but not ok if a computer does it by himself?

Trundle · on Aug 24, 2016

If someone was to sit down and use paper + pencil + time to accomplish the same thing, would you still have issue with it? It's publicly available data. Should you also not watch your competitors television ads or walk in to their physical store and browse around?

Karawebnetwork · on Aug 24, 2016

The difference being that I can scrape the data of ALL my competitors since 2008, get the number of likes and comments of every single comment and automatically generate graphs of the data. Thousands of man hours in one minute.

This gives an unfair advantage to the tech-savvy "hackers". Facebook terms protects against this. Thus scraping it is disallowed.

I couldn't say if it would be moral or immoral to do this. Personally, I'm more concerned about the well being of poor scraper program that has to scrape through an entire decade of Facebook posts. Poor thing.

charlesdm · on Aug 23, 2016

This is called hustling. I love it. What's not to like?

ori_b · on Aug 24, 2016

By ignoring robots.txt and bypassing capchas.

angry_octet · on Aug 24, 2016

'Bad netizen stuff'? Is this a comment from 1997? Breaking captchas is 'blackhat'? What cozy hippy internet alternative reality does this come from?

These are the same websites and companies that are loading evercookies and doing browser fingerprinting, that break as much as possible the anonymity citizens should enjoy, with Real Name policies, using network analysis to find who your friends are and what your politics and buying habbits are, that routinely rip private information from you cell phone and share it with oppressive regimes.

You're not in Kansas anymore Toto.

malz · on Aug 23, 2016

> misuse of other's resources against their will

Nonsense, there is no implication that this activity is illicit. Many sites (I have worked with hundreds) are happy to be included in my service, but don't have the technical ability to provide a data feed. They were delighted when I told them I could aggregate their content without any extra work on their part.

We respect TOS, we respect robots.txt and so on. Just because you study scraping techniques doesn't mean you intend to break the law.

> Breaking captchas and the like is basically blackhat work

Um, captchas only work if they work. If breaking them is trivial, they shouldn't exist. Don't shoot the messenger for pointing out the front door is unlocked.

paulddraper · on Aug 24, 2016

"Don't shoot the messenger for pointing out the lock on your door can be picked"

drabiega · on Aug 24, 2016

"Don't shoot the messenger for pointing out that robots are capable of opening doors."

Fiahil · on Aug 23, 2016

Sometimes, scraping a website is the only way at your disposal to fetch relevant information, you've paid for (or not). It could be simply the opening schedule of your local administration or the status of different pieces of public infrastructure.

If your administration don't have the resources (and it's often the case) to maintain a proper JSON API for you to fetch with a fancy python lib, then, it's not "super bad netizen stuff" to scrap a few HTML/PDF/XLS, parse them and display them for convenient public consumption on your personal website (and paying for the bandwidth).

It's 2016. State-companies holding a third party responsible for their own outages and poor planning is _bad faith_[1]. ETL? Never heard of it?

[1]: https://citymapper.com/i/1208/soutenez-citymapper-et-lopen-d... (french)

pierrebai · on Aug 23, 2016

To know the TOS of a page, you need to read it. To know which links are part of a site and which are not, you need to follow the link. Having a TOS as part of the page content is akin to having a sign in a room, only readable by entering the room, that says "you are not allowed to enter this room".

Yes, this defense is being petty abotu details, but I find businesses using post-hoc discoverable limitations to limit people rights annoying.

Mikushi · on Aug 23, 2016

Instagram or Facebook, they thrive on stolen or relinked content and monetize it day in day out.

Being amazed at this kind of bad behaviour where the targets are some of the most despicable companies on the web is a bit ironic. Scrape away, these companies hurt the web, let's hurt them (even though, all the scraping in the world won't have any impact).

hobs · on Aug 23, 2016

So its moral to continue bad behavior because someone else did it?

cookiecaper · on Aug 23, 2016

It's not bad behavior. The companies that profit off this try to make you think it's bad behavior because they don't want to risk your taking any profit away, and they've installed laws that let them get away with this. They can violate to their heart's content, but unless someone else in the oligopoly sues over the matter (which they would never do, because the precedent may prevent their abuse of the law), the peons will be forced to comply. That's not how a competitive marketplace works, and it's why we have such a hard time breaking gridlock on web properties.

flukus · on Aug 24, 2016

> This post is kind of crazy, aggrandizing bad behavior and misuse of other's resources against their will.

How so? I send a web request, they send me the content in a response. If they aren't happy with that then they should refuse my request.

TeMPOraL · on Aug 24, 2016

I disagree. DOSing a site is bad behaviour, regardless of how you do this. But accessing it in an automated way instead of a browser? Not really. The deal on the Internet is like this: a website owner can provide whatever they want, and a visitor can read it however they want. Discriminating visitors based on whether or not they seem to be bots instead of people is going beyond what the site provider should do. So is detecting and blocking people using adblockers.

Grishnakh · on Aug 25, 2016

I agree with your assertion about "the deal on the Internet", but I disagree with your conclusions. IMO, site owners should be able to discriminate all they want. However, when someone's browser (or other software...) makes a request to that site, and the site serves them some data, the user should be able to do with that what they will: either honor or decline the requests for them to download various ads or JavaScript, for instance. It should be up to the site owner to craft their site to follow their policies and whims. What I'm completely against is the idea of using the government and law enforcement to enforce some site owner's policies. The only exception I can see for this is extreme cases where this general principle falls down: DOS attacks, for instance.

If I can modify my web browser to view a site, but skip the ads, that should be my right. If the site owner codes their site to detect this and then blocks my request to see their site, that should be their right. If I modify my ad-blocker to get around their ad-blocker-block, that should be my right, and so on. As long as we don't get into something like DDOS territory where a reasonable web site has no good technological way of avoiding the problem caused by a user, this isn't something for government to get involved in.

TeMPOraL · on Aug 25, 2016

Hmm... yeah, I guess what you (and the other commenter) describe is fairer than what I wrote. Thanks!

cookiecaper · on Aug 25, 2016

IMO all's fair as long as the solution is technical. Dragging it into the courts because you can't figure out how to stop them technically (especially if they're not actually disrupting anything) is inappropriate.

We need updated legislation that covers malicious actors that issue DDoS attacks but leaves normal people that scrape consciously and carefully alone.

madshiva · on Aug 24, 2016

So you think that people should waste their time solving captcha are the solution? people are paid to solve captcha, there's always something like that and then the users suffer more. It's not a solution at all.

twa927 · on Aug 23, 2016

How can TOS have legal power for the case scraping? A website is a public property. If I'm visiting it without logging in, I don't have a chance to accept TOS.

Imagine a hotel that makes guests sign a document saying they will not make photographs of the building. If I'm not a guest, I can take photographs of it and I can't even know that would be illegal.

buro9 · on Aug 23, 2016

The UK has a database law:

https://en.wikibooks.org/wiki/UK_Database_Law#Database_Right

If you scrape, and effectively reconstitute a database, then so long as the database originally had a "substantial investment" in it's "obtaining, verifying or presenting the contents" then yup... you have breached the database right, which is a modified form of copyright.

You may access said database (via the web), but as soon as you start reconstituting the database from scraping... you're in breach.

It's a law, it is illegal in the UK, I'm sure most countries have some equivalent law on their books, all of the EU does. The law looks recent, but UK copyright and patent used to cover it, the 1997 date is just a separate statute to clarify the position.

wumpus · on Aug 23, 2016

Actually, such database laws are rare. The US and Canada don't have one. See Feist v. Rural Telephone for an example of databases getting scraped & the scraper winning in court.

kuschku · on Aug 24, 2016

Actually, the US does have one.

The World Copyright Treaty of the WIPO, which the US also signed, enforces in Article 5 that every member country has to have a database law of this kind.

    ________________________

> Article 5: Compilations of Data (Databases)

> Compilations of data or other material, in any form, which by reason of the selection or arrangement of their contents constitute intellectual creations, are protected as such. This protection does not extend to the data or the material itself and is without prejudice to any copyright subsisting in the data or material contained in the compilation.

http://www.wipo.int/wipolex/en/treaties/text.jsp?file_id=295...

> United States of America

> Signature: April 12, 1997

> Ratification: September 14, 1999

> In Force: March 6, 2002

    ________________________

In fact, this fucking treaty is the only reason so many countries even have that at all – the EU didn’t have any Database Law before it was created, and the US threatened (as always) to boycott any country not signing.

cookiecaper · on Aug 24, 2016

To be clear, this wasn't a scraper in the networked computer sense. It's actually a perfect example of how meatspace safeguards don't translate because law is not equipped to handle the nature of cyberspace.

wumpus · on Aug 24, 2016

I don't see how that's true at all. Running a meatspace telephone book through a sheet-fed scanner and OCR isn't wildly different from scraping a website.

cookiecaper · on Aug 24, 2016

It's different because you don't contact another party's server to do it. The CFAA makes it illegal to "exceed authorized access" to networked computers. "Authorized access" is whatever the server's owner says it is. That's why the copyright status of factual accumulations isn't a protection for internet scraping.

If Feist v. Rural occurred now and Rural, like most companies, kept their information in a database online, Feist would lose not for copyright infringement, but for exceeding authorized access to Rural's server.

wumpus · on Aug 24, 2016

You've made a large number of authoritative-sounding comments on this story... and this one, like many of the others, is a guess.

cookiecaper · on Aug 24, 2016

To the extent that I can't tell what would actually happen in an alternate future where Feist v. Rural occurred in the digital realm, sure. To the extent that the CFAA allows companies to make that type of determination today in the actual timeline, no, it's not a guess (and I have the wrecked business to prove it).

devbug · on Aug 24, 2016

Exactly. You can't copyright facts.

cookiecaper · on Aug 24, 2016

You can't copyright facts in the US. You effectively can in the EU, as the grandparent discussed, as long as you demonstrate that it took significant investment to arrange the compendium of facts from which they were drawn.

flukus · on Aug 23, 2016

What is the definition of "reconstituting a database"? Aren't googles indexes doing that?

cookiecaper · on Aug 24, 2016

Yes. Google is violating practically every law of this type. They're allowed to do it because they have a lot of money.

jsprogrammer · on Aug 23, 2016

You may still record the responses you receive from such a database and use it for your own purposes. The database law only restricts making a duplicate database available to the public.

mrtksn · on Aug 23, 2016

What happens if you use that data to create entirely new database? Let's say, can I create a database of people who work at Google and like ice cream by scrapping linkedIn and Facebook?

duaneb · on Aug 23, 2016

> A website is a public property.

This isn't even true metaphorically. It's like a shop front: there may be public access, but it is NOT public property.

taftster · on Aug 23, 2016

"public property" may not be the correct metaphor. But neither is "shop front" correct.

Taking the store metaphor further, it would be more like you knocking on the front door of a clothing store and the store owners open the door and throw every possible piece of clothing at you, shirts, shorts, underwear, including coupons to "partner" stores, when all you wanted was a pair of pants.

Upon knocking, if the store owner hands you instructions on how to enter their store and interact with their products in a personalized shopping experience, that would be one thing. But when the clothing owner throws everything at you at once, what they flung at you is for all practical purposes public property.

cookiecaper · on Aug 23, 2016

>How can TOS have legal power for the case scraping? A website is a public property. If I'm visiting it without logging in, I don't have a chance to accept TOS.

This is called "clickwrap". There is usually a notice in the footer of each page that says something like "By using this site, you agree to our Terms of Service." Typically, this kind of notice has been held enforceable. More recently, judges have been demanding that such notices be placed more prominently before they're held enforceable (e.g., somewhere above the fold), but that's it.

>Imagine a hotel that makes guests sign a document saying they will not make photographs of the building. If I'm not a guest, I can take photographs of it and I can't even know that would be illegal.

The reasonable laws that exist in meatspace are not applicable online, because once you hit someone else's server, you're considered to be on their property and they have the right to control what you do there. There is no "public property" from which to safely stand and take photographs in the internet.

Also, photographs of structures may not be free to use. Architectural copyrights went into effect in the early 90s and have a term of either 90 or 120 years. Thus, if you take a photograph of a building built in 1991 and the year is not yet 2111, there is a chance that the architect can claim infringement.

wang_li · on Aug 23, 2016

I have a custom X-TOS header in all of my http/https requests that states that the company who owns rights to the website my request is sent to and replies with data owes me:

1. Total privacy, they will not track me activity on their website, including any logs.

2. They will send me a cashier's check for $1,000 for each byte that they send to me.

3. They will provide me with Mana Sakura's cell phone number.

I'm still waiting for checks and a phone number.

cookiecaper · on Aug 23, 2016

If you can convince a judge that this represents an enforceable contract, as has been done and established with clickwrap, then you should be able to get what you're owed. :)

It is ridiculous. Something like "pagewrap" can't trump the consumer protections that apply to a physical good like a book, it would be laughed off. But the law doesn't contemplate network access so reasonably.

apricot · on Aug 23, 2016

> Thus, if you take a photograph of a building built in 1991 and the year is not yet 2111, there is a chance that the architect can claim infringement.

The architect can claim infringement all they want, they don't have a case. From https://www.law.cornell.edu/uscode/text/17/120 :

The copyright in an architectural work that has been constructed does not include the right to prevent the making, distributing, or public display of pictures, paintings, photographs, or other pictorial representations of the work, if the building in which the work is embodied is located in or ordinarily visible from a public place.

cookiecaper · on Aug 24, 2016

Thanks for pointing this out. I had run across this before, but I guess I disregarded it as I felt that for many uses, "ordinarily visible from a public place" is vague and would still result in an action where one would be forced to prove that the depicted building is ordinarily visible from a public place in court. I would like to know if a "public place" means exclusively public property or if it means private property onto which the public is welcomed, and whether a pictorial depiction is infringing if the subject is rendered as it would be seen from a "private place", whatever that is, especially if the artist or photographer had not been on that private place personally.

This is an important caveat to architectural copyright, however, so thanks for clarifying.

apricot · on Aug 24, 2016

"Public place" is indeed a vague term. In some statutes, it has a definition that includes publically accessible private property such as common areas of businesses, hospitals, etc. Does it include places where access depends on payment of a fee? How high above the ground do public places extend (thinking of photography drones)?

See also https://mentalhealthcop.wordpress.com/2013/09/20/place-to-wh... which shows the same ambiguity exists in the UK.

jessaustin · on Aug 23, 2016

The reasonable laws that exist in meatspace are not applicable online, because once you hit someone else's server, you're considered to be on their property and they have the right to control what you do there. There is no "public property" from which to safely stand and take photographs in the internet.

IANAL but this seems perverse. In no meaningful sense am I on corporate property when my computer in my house sends signals to another computer, formatted so that they will be re-sent in turn to a series of other computers, the last of which decides on its own based entirely on the signal it receives from the penultimate host to send a "response" to a different series of other computers, the last of which is my computer in my house.

Surely there are better ways to enforce IP restrictions than this tortured analogy of networked computing to physical location?

comex · on Aug 23, 2016

"Clickwrap" refers to situations where you have to click through before using the service, hence the name. Agreements which are simply a passive notice in a footer somewhere are called "browse-wrap", and are much less likely to be considered enforceable:

https://en.m.wikipedia.org/wiki/Browse_wrap

cookiecaper · on Aug 23, 2016

The line is blurred between clickwrap and browsewrap -- those are colloquial terms to describe ToS notices, not legal terms. Is it still browsewrap if you say "By clicking any of the links on this site, you agree to the ToS"? How far away from the clickable buttons must the statement be to be browsewrap instead of clickwrap? The distinction is really only a technicality in the wording, not anything substantive. In practice, you are still being forced to agree to a binding contract (many of which remove one's right to sue in a court of law) just by going past a landing page.

Even if we entertain a distinction between browsewrap and clickwrap, browsewrap is generally enforceable, especially after minor modifications to placement and/or font size.

gnud · on Aug 23, 2016

That notice is typically in the footer, and a screen reader will reach the nav-bar before mentioning the TOS notice.

Even for sighted people, the notice is often easy to miss - and this is by design.

esrauch · on Aug 23, 2016

It's by design because the vast majority of people don't care about that information and it makes the website worse for them to have a big ToS banner at the top of your page.

I don't think many websites have a secret ToS that they hope you won't read, I think most of them don't even know what their own ToS say. I signed my lease on a site with an explicit checkmark for ToS that said I agreed I would only use exactly IE7 to use their site.

cookiecaper · on Aug 24, 2016

Which is plenty amusing, until one of those companies is suing you in a court of law for violating said ToS.

comex · on Aug 24, 2016

Neither of us has explicitly mentioned a jurisdiction, but assuming you, like me, are referring to the United States...

I suppose I can do no better than quote from the Wikipedia page I linked:

> The Second Circuit then noted that an essential ingredient to contract formation is the mutual manifestation of assent. The court found that "a consumer's clicking on a download button does not communicate assent to contractual terms if the offer did not make clear to the consumer that clicking on the download button would signify assent to those terms."

The same page cites a number of cases where a "browsewrap" agreement was found unenforceable and only one where one was found enforceable - and the latter, for what it's worth, involved a sale taking place through the website rather than anything resembling passive browsing. Of course there exist other cases not listed; and there are situations that muddle the distinction between clickwrap and browsewrap. But at least, the very common pattern of, as you said, burying "a notice in the footer of each page" without anything vaguely resembling active consent, as practiced by probably the majority of commercial websites on the internet, seems to pretty clearly fall on the unenforceable side of the line based on those precedents.

lugg · on Aug 24, 2016

Heh, fwiw, there is another one, shrink-wrap agreements, where you can't read the agreement until you've removed the shrink-wrap but doing so means you've agreed.

sseveran · on Aug 23, 2016

One of the original court cases covering this was eBay vs Bidders Edge.

https://en.wikipedia.org/wiki/EBay_v._Bidder%27s_Edge

The courts have generally disagreed with that interpretation.

dragonwriter · on Aug 23, 2016

> A website is a public property.

No, its not. It may be in public view, but that's a different issue.

makmanalp · on Aug 23, 2016

That's an interesting analogy - though you're allowed to take photographs of whatever is in public view in many jurisdictions. Now if you wanted you could take this argument to the extreme, but surely there's some parallel between sending and receiving photons across the border of someone else's property (perfectly agreeable) and sending and receiving requests?

gsnedders · on Aug 23, 2016

Both the LinkedIn and OKC cases involved the scrapers using logged in accounts.

hooph00p · on Aug 23, 2016

> A website is public property.

This is a gross misunderstanding of how the internet works.

minimaxir · on Aug 23, 2016

That analogy is not equitable. If you take photographs of a building while on the building's property, they have the right to tell you to stop, or call the police to escort you off if you refuse to do so.

tshaddox · on Aug 23, 2016

Regardless of whether that would be reasonable, is it actually true? I know that the United States has specific rules for "public accommodations," which are private properties that are generally accessible to the public, like retail businesses. Property owners in this case don't have complete control over who enters their property. The obvious example is refusal of service due to membership of a protected class like race or religion.

So I'm not so sure that police will escort you out of a Walmart because they caught you taking a picture of the parking lot with your smartphone.

ysavir · on Aug 23, 2016

Let's go with a more apt analogy:

If you're entering a country, do its laws not apply to you until you've seen a copy of them? "Oh, sorry, no one told me theft is illegal here. Where does it say that? Oh, I see. Okay. I'll stop now. Thanks for letting me know."

If you cross the border without necessary documents, does that country have no right to detain you, simply because you haven't checked the laws?

Just because a website is visible and public doesn't mean its content is public domain. It just means that your first order of business as a user should be to check the terms of service. Sure, most people using a website probably don't need to--same as not needing to check a country's stance on murder--and so can just use the website as intended without violating the terms. But when you plan on using it in a way that might not be intended, and you don't check the terms of service, well, that's on you.

tshaddox · on Aug 23, 2016

Country's laws are a bit different, simply because a country has virtually absolute legal power over its territory. Countries can and do punish people for breaking laws that one cannot feasibly know they were breaking. Does any human know all the laws in the United States? Would that even be physically possible?

mentat · on Aug 23, 2016

There's some interesting science fiction opportunities here. When you open a connect to a site then all traffic over that connection is subject to the jurisdiction of the ToS for that site regardless of disclosure.

Also we don't even know how many laws there are in the United States for I'd say knowing the content is impossible.

buzzdenver · on Aug 23, 2016

That is not a good analogy. There is such a thing as reasonable expectations when visiting a website, so you do not need to read the TOS. Otherwise I could put "you own me $1000 for visiting my site" into the TOS. In other words, just clicking on a page does not constitute entering into a contract with the website. Registering and accepting the TOS does, but that still doesn't mean that anything in the TOS is enforceable.

slrz · on Aug 24, 2016

But when you plan on using it in a way that might not be intended, and you don't check the terms of service, well, that's on you.

I don't need to check your terms of services if I'm doing something that I'm allowed to do by law anyway; the TOS cannot deny me those rights (they might, of course, grant me additional rights provided that I follow certain conditions).

mindslight · on Aug 23, 2016

Sure, but they do not have the right to retroactively declare you as having been trespassing, nor even to preemptively put up a "no photography" sign and have you arrested for trespassing if you disobey it.

The entire point of protocols is to precisely define the terms of communication. The status code is '200 OK', not '200 OK/Asterisk'. But of course if lawlers didn't force themselves into the situation, they'd be out of jobs.

As an aside, I'd really like to see a browser plugin that would scrape sites in the normal course of access, storing the proceeds in a distributed public database.

cookiecaper · on Aug 23, 2016

>As an aside, I'd really like to see a browser plugin that would scrape sites in the normal course of access, storing the proceeds in a distributed public database.

This would be copyright infringement, since the content of the page is a substantive unique work that is automatically copyrighted by its author. A site that doesn't want you scraping its content is not going to want you posting dumps of its pages. Much like BitTorrent, they'd get into the protocol and send subpoenas to the ISPs behind the IPs that serve their pages, and use that info to sue the customer.

When my company was shut down by a legal threat related to scraping, I did suggest to my lawyer that we create something like a browser extension that would grab the data we needed out of normal client-side browsing sessions. This wouldn't be as nice as controlling the flow of information ourselves but it would've worked OK. My lawyer strongly suggested avoiding that as it could've been construed as conspiratorial conduct that would've made criminal prosecution more likely.

niftich · on Aug 23, 2016

Not the discount the validity of your experience, but the usual counterpoint to this is Google, who (like mentioned elsewhere in the thread) has been continuously scraping since the very beginning and in fact built their entire business model on doing so. They are also responsible for advancing the state-of-the-art of scraping (albeit mostly internally), through the development of V8 and headless Chromium so that they can inspect dynamic pages too.

Perhaps this illustrates the fungibility of the legal system: it's an inherently human construct that pits a plaintiff against a defendant, and given a big enough warchest and persuasive-enough arguments, catastrophe can be avoided -- by Google; perhaps not by you, me, or someone else.

cookiecaper · on Aug 23, 2016

Yeah, Google violates the CFAA and infringes on copyright as a matter of course. Their service would be impossible if they weren't doing so.

The main difference when Google was small was that Google was not dependent on any data source in particular, so even if someone denied their robot or sued them, they could cease and desist without affecting the overall value of their offering. This is different if you are getting data that is only available from one or two sources.

Now, the main difference is that Google is one of the biggest companies in the world, and they'll sick an army of $1,000/hr lawyers on you if you even think about taking legal action against them. The only people who can afford to fight are other big companies, but that's not going to happen because they all depend on breaking the CFAA for their own purposes and then using their position as a huge company to bully small innovators.

mthoms · on Aug 23, 2016

Google's crawling and caching has been largely found to be fair use and thus is not considered to be infringing copyrights.

https://en.wikipedia.org/wiki/Field_v._Google,_Inc.

There are similar rulings for thumbnail images:

https://en.wikipedia.org/wiki/Perfect_10,_Inc._v._Amazon.com....

And of course books:

https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....

cookiecaper · on Aug 23, 2016

Incidentally, this only further proves my point. If you're a big company that's retained massive law firms, you can successfully raise a fair use and implied license defense. If you're not, you can neither mount a strong offense against that defense nor mount a strong defense against Google's hypocritical offense if you find yourself on the other side.

Google's primary out here is its reputation (not guarantee) for obeying robots.txt. If Google indexed a page that disallowed it in robots.txt, the case would be much stronger. There's also the unofficial out, which is that judges think Google is a cool large company, so they rule in their favor based on their personal biases.

Fair use is a case-by-case basis, so you can't say that Google's infringing conduct is generally accepted to be fair use. The EFF had to take on Universal in Lenz v. Universal Music Group, and that went up to the Supreme Court. That's how individuals are left to assert their fair use rights.

mthoms · on Aug 24, 2016

You claimed "[Google] infringes on copyright as a matter of course" despite the many real civil cases (previously cited) which have found these very activities to be non-infringing. And then, strangely you claimed:

>Fair use is a case-by-case basis, so you can't say that Google's infringing conduct is generally accepted to be fair use.

There is so much wrong with this statement. For one, how can you call something infringing at the same time you point out that nothing has been proven? That simply defies all common logic.

Secondly, in general terms, the activities in question have been found to be non-infringing by the courts. Sure fair-use is case-by-case but if you're operating within similar parameters as a previously litigated case, then the legal risk is immensely reduced.

I don't disagree with your assertion that the legal system greatly favours the well monied/connected (I don't think anyone would). But you can't claim it to be fact that Google Search is infringing anything with little to no evidence or rulings to cite. Unless you're just stating an opinion in which case you should clearly indicate that.

cookiecaper · on Aug 24, 2016

First, IANAL, so my use of some terms may be loose. I never intend to convey more than an informed layman's opinion. However, I do love it when I'm corrected so that my usage can improve.

Fair use is an affirmative defense. Google admits that it copies content without legal license to do so, but claims that said copies are non-infringing under fair use exemptions. I guess you're probably correct that it's no longer appropriate to refer to Google's behavior specifically as "infringing", just "copying without authorization", which, for those of us without $5 million to commit to a legal team, means "infringing". I will try to remember the special standard of law which has been allowed to Google and refer to their copying only as "unauthorized" and not "infringing" in the future.

If you review the points summarized in the Wikipedia articles you helpfully linked, you'll see that Google's defense is mostly "Yeah, but we're Google".

In Field, "the court found that the plaintiff had granted Google an implied, nonexclusive license to display the work because of Field’s failure in using meta tags to prevent his site from being cached by Google.", i.e., because Field already knew Google existed and knew there was a standard way to prevent its access but chose not to employ it, he gave Google an implied license.

Who else does that work for? Can I send an email to Netflix and tell them "Hey, if you don't want me to copy your shows, please add this in your page's HEAD element: <meta name='please-dont-download-my-shows-sir'>"? No?

I understand there are other criteria which were used to decide if Google's use was specifically infringing in addition to the implied license. Just demonstrating that Google is getting favored treatment from the judiciary that would not be available to a normal entity.

In Perfect 10 [0], the judge even explicitly indicated that he was loathe to find Google's use of thumbnails infringing because he didn't want to "impede the advance of internet technology", but that he felt the law obligated him to do so (his ruling in that matter was overturned on appeal, when the Ninth Circuit found Google's usage non-infringing). What if the defendant had been some company perceived as less technically advanced than Google? This is probably as close as you can get to an explicit statement of favoritism. The Ninth Circuit also rejected Perfect 10's claim that RAM copies were infringing (which was not the case with an unlucky non-Google company discussed further down).

What if I started indexing and rehosting thumbnails? I can assure you that I would get C&D'd almost immediately and I would be forced to shut down because I can't afford to pay lawyers for 3 years while the case works through the system (and to be honest, I'm surprised it only took 3 years). And even if I could, with a reputation less sterling than Google's, there's no reason to believe that a judge would rule in the favor of one useless guy instead of a big company. A judge would look at the case and say "Google's use was fair because it provided a public service [actually cited as part of the justification in most of your linked cases], but this guy is just using it for a few hundred people, it's definitely unfair, he owes that company more money than he'll make in his life, case dismissed".

There are many such cases on the books. I don't know if Google has a direct connection to the reptilian overlords or what, but it seems in most cases where they're not involved, the good side loses.

In Craigslist v. 3Taps, while primarily a CFAA case, 3Taps was found to be infringing copyrights by sampling Craigslist postings in order to allow its clients to plot them on a map. Being a "public service" or a "referential use" didn't matter for them. They were raked over the coals, and it's been that way with most cases.

In Ticketmaster v. RMG Technologies [1], RMG was found to infringe just by parsing a page. "Defendant's direct liability for copyright infringement is based on the automatically-created copies of ticketmaster.com webpages that are stored on Defendant's computer each time Defendant accesses ticketmaster.com. [...] Defendant contends [...] that such copies could not give rise to copyright liability because their creation constitutes fair use[.] [...] Defendant's fair use defense fails."

The case specifically discusses how, despite the precedent in Perfect 10, since the Defendant is not Google, it is bound by a site's Terms of Use and copyright law, and RAM copies, which are specifically non-infringing for Google, were infringing for RMG.

Very similar findings were made in Facebook v. Power Ventures, and the founder was left holding a bag of $3 million in personal liability.

This is a thread about the legality of HN users scraping. It seems Google is the only entity capable of making unauthorized copies and then getting courts to agree that it's fair use. For the rest of us, it's infringement, which carries stiff penalties (and this doesn't even broach the CFAA portion of the issue).

So when I say "infringing", I mean something that would be considered infringing if you aren't Google. It's apparently only infringement if the judges involved don't personally use your site and don't have to worry about personally suffering the consequences of not having access to it. :)

[0] https://www.eff.org/document/perfect-10-v-google-ninth-circu...

[1] https://scholar.google.com/scholar_case?case=147697505884223...

mthoms · on Aug 24, 2016

What you've failed to mention is the criteria used to determine if a usage is indeed "fair". There are 4 basic criteria[0] but can be summarized as "If the usage doesn't affect the market for the original work, is substantially transformative, is proportionally insignificant or is used for critique/parody then it is fair". Or, at the risk of over simplifying it: "Does the usage grant a net public benefit without significantly hurting the copyright holders ability to make money?".

>Can I send an email to Netflix and tell them "Hey, if you don't want me to copy your shows, please add this in your page's HEAD element: <meta name='please-dont-download-my-shows-sir'>"?

Actually, under fair use you certainly can make a personal copy (see Betamax case). If you distribute the work you would likely run afoul of the criteria summarized above.

The robots.txt relevancy is being over stated in your argument. The main criteria used in this case is summarized above. The fact that Google provides an opt-out mechanism is a secondary, supporting argument.

>What if I started indexing and rehosting thumbnails? I can assure you that I would get C&D'd almost immediately

A determination of infringement would depend entirely on the context as related to the afore mentioned criteria. The fact that someone might try to sue is a product of the terrible system in general and you're absolutely right - as with any legal matter the entity with the deeper pockets can often bully the other guy into submission.

>In Craigslist v. 3Taps, while primarily a CFAA case, 3Taps was found to be infringing copyrights

My understanding is that the copyright part of the case was thrown out [1] and thus was settled solely around CFAA matters.

>In Ticketmaster v. RMG Technologies , RMG was found to infringe just by parsing a page.

I agree that the logic used for the judgement is absurd (for reasons that are plainly obvious to any HN user). But it's less clear whether the case would meet fair use criteria outlined above should it have come to that. My guess is that it wouldn't qualify since the usage affects the copyright holders ability to make money on the work and doesn't meet any of the other criteria for Fair Use.

>Facebook v. Power Ventures

This is not a case involving a defense of fair use (as far as I can tell). Facebook even acknowledged the users owned the data and had a right to it. The defendant was actually found to be violating CFAA and CAN-SPAM acts.

>It seems Google is the only entity capable of making unauthorized copies and then getting courts to agree that it's fair use. For the rest of us, it's infringement

Provably false [2]. It sounds like perhaps your personal experience has soured your opinion on the matter? That's understandable. But none of the evidence you've cited supports the argument that Google is infringing copyrights in its core activities nor that Google is the only entity where copyright laws and fair use legislation don't apply.

PS: To be clear, my argument revolves specifically around copyright infringement and fair use. I don't have enough understanding of other, separate legislation like CFAA to comment on that except to say that it seems overly broad and unrealistic. But that's another topic. I'm specifically arguing against calling Google a copyright infringer in a broad sense which is what you've done. That's not been proven.

[0] https://en.wikipedia.org/wiki/Fair_use#U.S._fair_use_factors [1] https://techcrunch.com/2013/04/30/craigslist-3taps-lawsuit-d... [2] http://fairuse.stanford.edu/overview/fair-use/cases/

cookiecaper · on Aug 25, 2016

>What you've failed to mention is the criteria used to determine if a usage is indeed "fair".

Yes, I understand that the criteria for fair use is defined in the statute. What I'm saying is that like most things brought before judges, arguments can be made either way, and judges seemingly favor Google but not smaller defendants. Thus, while the RAM copies of web pages made by Google are fair use, those made by RMG aren't.

If you look at the Ninth Circuit's ruling in Perfect 10, the length they stretch to reverse the District Court's finding of thumbnails as infringing is ridiculous. It's pretty clear that thumbnails are direct infringements and that you don't invalidate the copyright or create a truly "transformative use" by making it smaller and adding it to an index. Perfect 10 was certainly of this opinion, and I'm sure they saw a real impact to their revenue.

Over the years I've learned that no position is too high to disregard the human factor. 99% of the time people are going to act primarily to their own benefit and work backwards to find rational (or rational-sounding) arguments to justify it. Judges are politicians and they're very image-conscious. None of them wants to be the one to make Google Image Search useless.

You seem to be saying that since Google's use was found non-infringing in these cases, its use is objectively non-infringing. I don't agree with this. Rather, I think that Google's conduct is a pretty plain violation of the relevant statute(s) and that most of it is not covered under fair use, the way the laws are currently written. I think that judges apply the statute in full force when smaller defendants present, but that they have a bias for Google (which is really a bias for themselves, since they know that serious backlash awaits the judge who puts the kabosh on it) that causes them to contort the law pretty heavily so that they can rule the way they want to.

>Actually, under fair use you certainly can make a personal copy (see Betamax case).

See, we were on the right track before we got into networks. Since then, the rulings have been pretty darn bad. The modern "Betamax case" may well have been American Broadcasting Cos. v. Aereo, Inc. [0], and it wasn't a win for us.

Note also that separate from the copyright concern, the DMCA makes it illegal to circumvent a copy protection device (or indeed, even to teach another how to do so). Since Netflix employs DRM, even if there is a fair-use right to a copy of a Netflix program (which is by no means certain), you'd probably have to break the DMCA to obtain it.

>The robots.txt relevancy is being over stated in your argument. The main criteria used in this case is summarized above. The fact that Google provides an opt-out mechanism is a secondary, supporting argument.

I disagree. Google has been able to discharge all CFAA claims because the judges have said "Well, you knew there was a way to stop it." If that's the logic, I'll happily inform the parties I may scrape that there's a way to stop it.

>A determination of infringement would depend entirely on the context as related to the afore mentioned criteria.

Yes, I understand that the judge would write a report that appeared to consider the relevant criteria. The real question is, would that judge be willing to make the same logical contortions that other judges have made for Google?

I think that he would just go in favor of his biases, and right now we have a judiciary that is heavily biased against the little guy from the start, and this is only exacerbated by an inability to retain hotshot lawyers.

>My understanding is that the copyright part of the case was thrown out and thus was settled solely around CFAA matters.

The only portion of the copyright claim that was dismissed was Craigslist's claim that it owned an exclusive license in the scraped content. This was based on a short-lived ToU update that was specifically intended to strengthen Cragislist's case in this instance. The remaining copyright-related claims were allowed to stand, including a claim that Padmapper had violated a copyright Craigslist holds on the collection of advertisements (rather than on the advertisements themselves). [1]

>[re: RMG] I agree that the logic used for the judgement is absurd (for reasons that are plainly obvious to any HN user).

If you agree the logic was absurd, you agree that a copy of the page that exists in RAM for microseconds does not qualify as a protected copy any more than the reflection of an image on one's retina qualifies. As a "copy" that should be ineligible for copy protection, it doesn't matter if it qualifies for fair use (and I don't necessarily agree that it wouldn't).

> [re: Facebook v. Power] This is not a case involving a defense of fair use (as far as I can tell).

Correct. I was including it because it's an example of Google getting another free pass for stuff that shuts others down, which is the CFAA. CFAA claims are raised against Google in at least Field and Perfect 10, and they get dismissed based on the judge's assumption that the plaintiff knows about the special steps Google makes you take to stop them from violating the CFAA, the absurdity of which we've already discussed.

My wording that the "findings were very similar" was definitely bad since a different law was in play. I meant they were very similar in nature, not in fact. That said, it's likely the only reason that the cached pages weren't considered infringement is that Facebook didn't bring it up.

>But none of the evidence you've cited supports the argument that Google is infringing copyrights in its core activities nor that Google is the only entity where copyright laws and fair use legislation don't apply.

Again, I'm discussing this from a practical position, not one that is strictly compliant with legal theory, where judges always enforce the law with perfect equity, and in which anything a judge (or jury) finds becomes Official Truth de-facto.

From a textbook perspective, sure, everyone has all the same rights and the legal system is always applied equitably. I simply don't believe that has borne out in practice when it comes to internet-centric companies that aren't household names.

It seems that the things Google does are considered infringement when other people do them. Thus, it behooves to know the actual law and follow it, even if Google gets a free pass, since we can't rely the judiciary to interpret the law favorably for us.

RMG is a great example because it occurred after Perfect 10, and the same argument against RAM copies was raised in both cases. It's apparently fair use if Google scrapes your page to download and rehost all of your images, but it's not fair use to read out non-copyrightable factual data unobtainable from any other source (like ticket prices and event times) and rehost it nowhere. Sure.

The alternate lesson here is to focus on getting really big and powerful really quickly, and making sure you cultivate a good public image, so that judges are afraid to rule against you in ways that would affect a product offering upon which millions of people depend. That seems to have worked for most big internet companies, actually. Definitely worked for Facebook and Google.

[0] https://en.wikipedia.org/wiki/American_Broadcasting_Cos._v._....

[1] http://www.dmlp.org/sites/dmlp.org/files/2013-04-30-Order%20... pgs. 9-16

wtracy · on Aug 23, 2016

> they'll sick an army of $1,000/hr lawyers on you

They don't even need to do that. They just cheerfully agree to not scrape you, and wait for you to come back and beg to be re-instated when your search traffic plummets.

SilasX · on Aug 23, 2016

Isn't that what webarchive/wayback machine do? I think they use a "Fair Use" defense.

mindslight · on Aug 23, 2016

Oh for sure. But BitTorrent is still around and works great!

Asooka · on Aug 23, 2016

Well, it's just a technical response code. 200 OK - everything went as normal, here's your data. By the same margin, the door on a shop doesn't stop you walking out without paying and the road markings don't stop you from driving in the wrong lane.

I think imbuing technical protocols with legal implications would be even worse than the current situation since then changing anything on a protocol would require changing the law and getting a protocol implementation slightly wrong would carry real-world legal repercussions on the order of licensing your work in the public domain rather than retaining copyright. Let the lawyers make the law and check the human terms of service before using the data. Trying to out-lawyer the lawyers is like challenging a hedgehog to a butt-kicking brawl.

Doctor_Fegg · on Aug 23, 2016

The protocol is also that you send a valid, non-faked User-Agent:

"The User-Agent request-header field contains information about the user agent originating the request. This is for [...] the tracing of protocol violations [...]. User agents SHOULD include this field with requests"

Many scrapers disregard this part of the protocol. Of course, whether a headless browser should send a different UA is an interesting question.

https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html

TheCoelacanth · on Aug 24, 2016

User-Agent is a SHOULD, not a MUST. There are also practically no browsers that send a non-fake User-Agent, since they almost all claim to be Mozilla/5.0.

jessaustin · on Aug 23, 2016

One rarely visits corporate property in order to access corporate websites. The analogy may be flawed, but this objection to it is as well.

pyre · on Aug 24, 2016

A better metaphor for this would be the "Sunday Flyers" that come in the newspaper (e.g. for big box stores like Best Buy). They sent that information to you, they can not then attempt to restrict how you use that information (though they have tried to claim copyright over pricing against sites that aggregate the flyers).

cookiecaper · on Aug 23, 2016

Yes, it's important to understand that in the United States, web scraping is usually an illegal activity under the CFAA. If you draw enough attention, your scrape target will notice and threaten you, and probably follow through with the suit. Since the CFAA prescribes both civil and criminal penalties, you may even find yourself in jail for accessing data without the company's approval. Aaron Swartz was being prosecuted under these provisions for scraping public domain data.

The CFAA is a really bad law and creates the network effect lock-in that we all considered a natural part of the web. It doesn't have to be that way -- users should be free to use any browsing appliance they want, including so-called "scrapers".

Big companies like Google not only got their start by flagrantly violating the CFAA, copyright, and privacy laws, but they continue to do so. The moral of the story is hurry up and get big before you get sued or arrested.

There's a long history of ridiculous web scraping rulings based on technical misunderstandings by neophyte judges, including Ticketmaster v. RMG, where infringement was found because the company scraped data out of a page with the Ticketmaster logo on it.

Facebook sued a company called Power Ventures which read out only the user's own data. The founder was found personally liable for $3 million in damages. Facebook did this because they don't want it to be easy for their users to move between social media services. If it's easy, Facebook has to compete on merit instead of just keeping switching costs high. Facebook doesn't like that, so they sue people who make it possible -- and the law says they should win.

We badly need a revised law, but the powers-that-be will strongly oppose it because it would threaten their monopoly over web properties. They continue to flaunt their strategic ignorance of these laws and then take shelter behind them to stop risk from small innovators (i.e., having to compete fair and square).

In the real world, we have a lot of laws that mostly prevent this kind of bad behavior. In cyberspace, the structure is such that most of those laws are not applicable. We need to update and port the pro-small-business logic we have for meatspace companies so that it counts online too. The state of affairs online is really bad.

I want to get a law called the "Consumer Data Freedom Act" passed, which would allow users to access any web property with any non-disruptive browsing device, including custom scrapers that don't impose much more load than a typical user browsing session would.

0xdeadbeefbabe · on Aug 23, 2016

Yes laws are neat and a reason for attending law school I suppose. I'm of the simpleton opinion that TCP/IP and the other protocols are the law of the net, and you ought to start with those.

travmatt · on Aug 23, 2016

One of my favorite scenes from 'Blow':

Judge: George Jung, you stand accused of possession of six hundred and sixty pounds of marijuana with intent to distribute. How do you plead? George: Your honor, I'd like to say a few words to the court if I may. Judge: Well, you're gonna have to stop slouching and stand up to address this court, sir. George: [stands] Alright. Well, in all honesty, I don't feel that what I've done is a crime. And I think it's illogical and irresponsible for you to sentence me to prison. Because, when you think about it, what did I really do? I crossed an imaginary line with a bunch of plants. I mean, you say I'm an outlaw, you say I'm a thief, but where's the Christmas dinner for the people on relief? Huh? You say you're looking for someone who's never weak but always strong, to gather flowers constantly whether you are right or wrong, someone to open each and every door, but it ain't me, babe, huh? No, no, no, it ain't me, babe. It ain't me you're looking for, babe. You follow? Judge: Yeah... Gosh, you know, your concepts are really interesting, Mister Jung. George: Thank you. Judge: Unfortunately for you, the line you crossed was real and the plants you brought with you were illegal, so your bail is twenty thousand dollars.

tedunangst · on Aug 23, 2016

Excellent news. I'm of the opinion that might makes right is the law of the land, and I'm going to start by buying a bigger gun.

0xdeadbeefbabe · on Aug 24, 2016

If you were really of that opinion I think you would hide instead.

kuschku · on Aug 23, 2016

Or you just move to a locale where scraping is legal, and any contractual terms saying otherwise are null and void.

I’d assume a lot of HN users are from such locales.

We don’t always have to assume US laws apply globally – they don’t.

cookiecaper · on Aug 23, 2016

I was actually searching for such a jurisdiction as my startup was shut down by a company that invoked the CFAA late last year. What do you suggest? The EU is even worse than the US when it comes to data freedom and tech access. The law on the books in many former British colonies provides marginally more protection (the "Telecommunications Act"), but it'd probably still be disputable, and you'd be shut down anyway unless you had millions sitting around with your lawyers' name on it.

detaro · on Aug 23, 2016

It depends on what you are doing. The CFAA is very far-reaching, but of course many aspects were the US answer is "CFAA" are covered by other laws. [EDIT: removed outdated information superseded by european decisions, which make the situation a lot less clear]

anti-scraping: If somebody were to offer a telephone book database online and you created a copy of that to sell on your own, you'd almost certainly loose in the EU (since unlike in the US, databases as pure collections of facts have their own copyright protections)

The legally safest locations probably are outside the western world if you are targeting western sites.

cookiecaper · on Aug 23, 2016

>Pro-scraping: Last big case I remember here was a flight-search site that did flight search and booking(!) via a scraper and Ryanair lost when they tried to sue them for that, since they couldn't argue convincingly how that was damaging them.

Every case I've seen wrt Ryanair (they sue a lot of people) has resulted in a win for Ryanair. Do you have details on the case you're describing?

>anti-scraping: [...]

Scraping purely factual data is one of my points of defense in the US. I don't want to give it away.

>It's still risky though, the safest locations probably are outside the western world if you are targeting western sites.

Yeah, this was ultimately the conclusion I had to come to. However, outside the West, the Western companies will just send someone with a briefcase full of $100 bills and pay them off. Corrupt government officials in these locations want the goodwill of a big American company a lot more than they care about any particular random guy.

There is only one workable solution: run the service totally anonymously and maintain good opsec so that your cover isn't blown. All under the table. This has its own issues, like making it difficult to receive payment and putting one at much greater legal risk than a mere CFAA dispute, but it's the only option if you don't plan to get shut down.

detaro · on Aug 23, 2016

I was refering to a BGH decision (30.04.2014 (Az. I ZR 224/12)), but it seems like newer decisions from european courts kill that argument :/

I edited my original comment to reflect that.

charlesdm · on Aug 23, 2016

Which site were you running?

niftich · on Aug 23, 2016

Antigua and Barbuda? See https://en.wikipedia.org/wiki/Antigua#Internet_hosting_and_g...

charlesdm · on Aug 23, 2016

Dubai?

madamelic · on Aug 23, 2016

Scraping being illegal is as dumb as saying it's illegal to take photos in public. You aren't affecting anyone if you do it respectfully.

elmigranto · on Aug 23, 2016

Being dumb doesn't prevent laws from existing (we all know those "funny US law stuff" in the realms of "no kissing toy camels on the cheek (but the mouths are okay)".

Also, since this is somewhat untouched territory, don't be so sure that you'll get a judge who is as well-versed in web scarping and infrastructure as you, or shares your opinions on the subject. (And given that precedents are so important in US laws, you better hope someone else before you didn't get such a judge.)

downandout · on Aug 23, 2016

Obviously it's a good idea to follow TOS. But as a practical matter, they have to know that you're doing it before they can take action. You wouldn't want to put up a site announcing that you're selling scraped LinkedIn data, for example. But if that data is valuable to your business - collecting names of people that work in certain positions at certain companies so that you can do targeted snail mail campaigns for example - you could quietly scrape and use the data without issue. Use proxies and prosper.

cookiecaper · on Aug 23, 2016

This only goes so far, and if you get found out, you're looking at willful infringement (which usually triples damages) and probably criminal charges under the CFAA. However, it should be acknowledged that there are many people making quiet livings off scrapes that are not legal. There are even a few companies making loud livings off such scrapes, like Google.

If you're not going to run it totally anonymously, you should be prepared to jettison and repackage it when you get found it (so that you appear to be complying with the C&D).

Scraping is a huge part of the web, and everyone does it. It sucks that it has to live underground because only big companies can duke it out in court.

downandout · on Aug 23, 2016

TOS violations have been found to not be subject to the criminal provisions of the CFAA. The only circumstance under which it would become a criminal issue is if they successfully sued you for it, and obtained a judgment that included a provision ordering you to cease scraping. Ignoring such a court order would then potentially expose you to a criminal contempt of court action.

siegecraft · on Aug 23, 2016

I think the lessons learned from those lawsuits was to always have some sort of 3rd-party intermediary scraping consultancy firm you engage that is totally not just your business under another name.

kbenson · on Aug 23, 2016

> (Disclosure: I have developed a Facebook Page Post Scraper [https://github.com/minimaxir/facebook-page-post-scraper] which explicitly follows the permissions set by the Facebook API.)

I played with the idea of creating some social aggregation type service with some friends (as a business). The more I read about FB's past behavior with regard to this, and how essential they are to any sort of service like, that, I canned the project. Regardless of what their TOS say, if you get on their radar and they send you a cease-and-desist, it's game over. Facebook is not in the business of subverting their revenue stream, so if you are making money off them and it's preventing them from capitalizing on their users, don't expect to last long if you exist by their grace.

Really, there's an interesting space between so small nobody cares and large enough that getting shut down is a real problem. A lot of projects start small and end up (relatively) large, but without a good way to pay for the service itself. While not every service needs to be a business and make money, once you reach the level where you risk either being shut out of your data source or you need to somehow work out an understanding with that source, how do you approach that when being able to pay is off the table? Not to mention the problem approaching before you have to and forcing the situation, or waiting too long and risking the wrath of the source because you've abused their service as long as you have. Has anyone else been in this situation and found an approach that works?

headmelted · on Aug 23, 2016

One thing I'm not quite clear on here.

I understand the use of ToS clauses to prevent scraping but I do kind of wonder to what extent they have authority here.

IANAL, but surely this would fall under copyright law? While re-publishing copyright-protected data without consent is probably unlawful in your region (like scraping an art site and re-posting the images), I wouldn't think just scraping data points for a different purpose (like scraping amazon for the purposes of price comparison) is nearly so clear cut (or enforceable), but maybe I'm just naive.

cookiecaper · on Aug 23, 2016

The content falls under copyright law. The problem is that you have to enter the company's servers to obtain this data, and the CFAA says that the company can treat their public-facing web servers like private property, and if you're caught "trespassing", you can be sued and jailed. Scraping plaintiffs are usually granted an injunction based on "trespass to chattels" (among other rationales), i.e., trespass to an individual's property (as opposed to land).

Companies like PriceZombie are forced to stop because the CFAA says that Amazon can prevent them from accessing their servers by decree alone. A ToS isn't even really necessary for this, but it helps them pin down their argument.

PriceZombie could try to get the data from third-party caches, but it only solves part of the problem, because copyright and trademarks come back into the picture once you have a replica of the target page. In Ticketmaster v. RMG Technologies, the judge found RMG infringing on Ticketmaster's trademarks and copyrights because the page they were scraping included Ticketmaster's logo. The judge said the copy of the full page that existed momentarily in RAM while the scraper extracted the non-copyrightable data constituted a copy that infringed on Ticketmaster's rights, even though the logo was never used by the application in any way, it just happened to be on the page.

ChuckMcM · on Aug 23, 2016

I was going to post something similar. When you go to all that trouble that the web site owner is pretty clearly trying to prevent, that is convincing evidence that you are breaking the terms of service. And breaking the terms of service for a web site has been held to be a civil violation (a number of times on Ebay and Amazon) and potentially a CFAA violation by the Justice department.

downandout · on Aug 23, 2016

Actually it's been held that TOS violations are NOT subject to the criminal provisions of the CFAA.

ChuckMcM · on Aug 23, 2016

Are you referring to the MySpace case? Or the July 2016 decision by the Ninth Circuit (https://cdn.ca9.uscourts.gov/datastore/opinions/2016/07/05/1...). In US v Nosal it seems like they come down in favor of a CFAA violation if the user acts in an unauthorized way. The author of the piece talks about bypassing captcha's which are, in one interpretation, a demand for authorization (by proving that you are a human and not a program) and by circumventing that authorization they have stepped quite clearly into CFAA territory.

If you were referring to a different decision I'd love to read it. I follow this stuff (and at one time explored what legal action our startup could take against scrapers). In our case we also offered a paid API so it was fairly easy to establish damages.

downandout · on Aug 24, 2016

The case you're referring to is an entirely different set of circumstances. From the text:

"The panel held that the defendant, a former employee whose computer access credentials were revoked, acted “without authorization” in violation of the CFAA when he or his former employee co-conspirators used the login credentials of a current employee to gain access to computer data owned by the former employer and to circumvent the revocation of access. "

I think that case is unambiguous - this guy was using someone else's credentials to access secured systems after having been explicitly told that he could not. I was referring to the MySpace case.

I don't think these two cases are in conflict; IMO they are very different. Additionally, for our purposes in this comment thread, we're talking about scraping of publicly available websites by outside parties, not by former employees whose access has been explicitly revoked. That is different than either of these cases.

yeowMeng · on Aug 23, 2016

I am no expert, but I always thought you could scrape without consequence provided you never distribute your scrapings?