Hacker News new | past | comments | ask | show | jobs | submit login
Web Scraping in 2016 (franciskim.co)
852 points by franciskim on Aug 23, 2016 | hide | past | favorite | 387 comments



Keep in mind that companies have sued for scraping not through the API, for example LinkedIn, which explicitly prevents scraping via the ToS: http://www.informationweek.com/software/social/linkedin-sues...

OKCupid did a DMCA takedown for researchers releasing scraped data: https://www.engadget.com/2016/05/17/publicly-released-okcupi...

Since both of these incidents, I now only scrape if it's a) through the API following rate limits or b) if there is no API, and the data has the explicit purpose of being shared publically (e.g blogs), I follow robots.txt. Of course, most companies have a do-not-scrape clause in their ToS anyways, to my personal frustration.

(Disclosure: I have developed a Facebook Page Post Scraper [https://github.com/minimaxir/facebook-page-post-scraper] which explicitly follows the permissions set by the Facebook API.)


On the ethics side, I don't scrape large amounts of data - eg. giving clients lead gen (x leads for y dollars) - in fact, I have never done a scraping job and don't intend to do those jobs for profit.

For me it's purely for personal use and my little side projects. I don't even like the word scraping because it comes loaded with so many negative connotations (which sparked this whole comment thread) - and for a good reason - it's reflective of how the the demand in the market. People want cheap leads to spam, and that's bad use of technology.

Generally I tend to focus more on words and phases like 'automation' and 'scripting a bot'. I'm just automating my life, I'm writing a bot to replace what I would have to do on a daily basis - like looking on Facebook for some gifs and videos then manually posting them to my site. Would I spend an hour each and every day doing this? No, I'm much more lazier than that.

Who is anyone to tell me what I can and can't automate in my life?


This is exactly my response to "you can't legally scrape my site because of TOS." I don't think anyone has a legal right to tell me HOW I use their service. Making "browsing my website using a script you wrote yourself" illegal is akin to "You cannot use the tab key to tab between fields on my website, you must only use the touchpad to move the cursor over each field individually."

It's baloney.


Under the CFAA, they do have the right to determine what constitutes authorized access. If they say you're unauthorized for using the wrong buttons on your mouse, then you're unauthorized. It's treated very similarly to trespass on private land.

You can try telling the judge it's baloney, but if he's going by current precedent, he probably won't agree with you.


It is not the same flavor of violation as trespassing on land, more like riding a bicycle on a sidewalk. Fortunately, I'm not in the business of scraping sites, but I still find this legal precedent abhorable, and I hope it gets struck down in court when push comes to shove. I would certainly vote that way if given the chance.


It's like riding a bike in a skate-park with a small sign saying "No bikes".


It's like riding a red mountain bike on a bike path with a small sign saying "only black road bikes allowed".


> Who is anyone to tell me what I can and can't automate in my life?

You are exactly right. But although a site can deny you access for any arbitrary reason (it's their website, after all) obviously government think they are the ones to enforce this crap.

What if the ToS say you can only access a site while jumping hoops? Only read the ToS after a while and wasn't hooping? Well too bad, now you are being sued for reading the main page _and_ the ToS page without jumping around.

This comment Terms of Service: If you read any of this text you owe lerpa $1.000.000 to be paid up until 09/01/2016.


It would be ok if it wasn't "You can't scrap my site. Unless of course you're Google" this double standard drives me mad.


As the owner of a large website, I don't care what you think. I block by default and whitelist when I decide it's in my interest.

If you don't think this is reasonable, chances are you've never run a large website, or analyzed the logs of a large website. You'd be astonished how much robotic activity you'll receive. If left unchecked it can easily swamp legitimate traffic.

Unless you have a way for me to automatically identify "honourable" scrapers such as yourself as distinct from the thousands upon thousands of extremely dodgy scrapers from across the world, my policy shall remain.


As the user of large websites I don't care. I'm not going to read the TOS and I will continue to scrape what I like since it makes my life more convenient. Like OP when blocked I'll just drive my scraping through a web browser which is the same as I've done for years on various sites that never provided APIs.


"As the user of large websites I don't care". Are you sure ? Do you want your OK Cupid or LinkedIn profile to be crossposted on another website without your knowledge.


If you don't want your data public, then don't make it public in the first place. That's a good rule of thumb.


Putting it behind a signup page with terms that don't allow sharing is not "making it public".

And while in the US that may "just" be treated as unauthorized access, in the EU, if you make the data public it's also a violation of the Data Protection Directive, putting you at risk of prosecution in every EU country from which you have included data.

You may be right from a risk minimisation perspective. But for a lot of data the risk in the case of exposure is low enough that it is a totally valid risk management strategy to assume that legal protections will be a sufficient deterrent to prevent enough of the most blatant abuses.


Eh, not really. The Data Protection Directive doesn’t even apply here – if the first party (OKCupid) made it available to a third party (the scraper), then the first party can be held in violation, but not the third party.


If you have control of personally identifiable data, it's likely that at least some of the EU data protection rules will apply to you regardless of how you got it.


Yes but as you say, they apply regardless. More specifically, they apply to data that you have (and are storing), not the act of obtaining it.

As a private individual it's not hard to comply either, for private use. If you publish it, it becomes a different story, because it's PII. And, as soon as it's in possession of a company, they need to comply with more rules about securely storing it, etc. (this isn't enforced very well, though). Private individuals can't be held to that because there's (in theory) no legal way to check it.


Why is it a double standard? Google scraping usually benefits the site with increased traffic and revenue, in a way most other scraping does not. Saying "you can scrape me if it benefits me" isn't totally in keeping with the principles of the open web, but it's not hypocritical.


With a risk of stating the obvious, this is a double standard simply because there are two standards - one for Google and one for others. I can't speak for the poster you were replying to, but whilst I see it as logical self-interested behaviour by site owners, it still feels unfair.


There isn't: the function for this standard includes expected benefit as an input. Every standard has inputs, so that certainly isn't the quality for making something a double standard. The only remaining quality is how unfair it feels, so it would probably be better to just address that, since it is obviously the only thing you disagree about.


With that logic decreased wages for women are not a double standard due to the potential for maternity leave affecting their output at work.

This is a double standard plain and simple, and a very dangerous one at that.


Your example is a case of discrimination, but the economic rationale is unquestionable. There is a tremendous upfront cost for new employees, who are not valuable contributors for some lengthy ramp up period and furthermore accrue experience over the course of employment. So the lifetime value curve for any given employee is typically skewed left.


I think I'm just being difficult.

My point was that when you call something a double standard, you're arguing two things of equal value have been judged differently under the same standard. But by acknowledging they've been judged differently, you're acknowledging that there is a judgement, a standard, that applies the same to both, and produces the results you object to. What you really object to is the fairness of the qualities checked by the standard.

Since the outcome of calling things that, vs calling them a double standard is the same, I think most people already know and have no trouble with this. My protests were worthless.

It could gain value if there were certain whitelisted judgable aspects (like expected value), and judgements that aren't based on things from the whitelist are considered outside the scope of a standard. Then, calling the standard unfair and calling it a double standard would have a different meaning (if only in some contrived way, since any aspect is just an argument away from the whitelist)


It's their site they can block whatever they want. The problem is the stupid far reaching conclusion that this is trespassing.

Even normal trespassing laws are way too overreaching (see how it is handled in the UK for a saner example) but now you have the amazing possibility of remote trespassing.

The fun part is that it's just a matter of someone hiding something that says you cannot access the site in a place that you have to access the site in order to read -- the ToS. Suing people over this is idiotic.

The real problem is the involvement of Govt, and this kind of absurdity regarding ToS, EULAs and so on, is something that has been going on for decades. If you have the money you can make Govt your personal watch dogs.


Whether or not it technically qualifies as a "double standard," in practice I don't see anything inherently unfair about it.

If a stranger enters my house without my permission, that's trespassing. But there's nothing unfair about letting in someone who I invite over.


That's a terrible analogy. Your home is private, websites are not. The fact is that websites are posted online for all to see, so it's more like saying certain people at a park may take pictures while others are not allowed. That's unfair. If everyone could take pictures, it would be fair. Yes, someone with an old bright bulb camera might be annoying people, but nobody said "fair" meant all players would be nice or that having a "fair" policy would somehow be more beneficial to the website owner. It's not, that's why site owners are selective. So they have a double standard, but it's for their benefit, not that of the site visitors (be they human or bot).


How about the analogy of an art gallery disallowing photography? Is the gallery being hypocritical when they allow the local paper to take photos for publicity, or when they permit an archivist that has a known reputation to take photos for archival purposes?


You can still deal with the old bright bulb cameras: you can have rules which apply to everyone. So you can make a rule at the park that pictures are allowed, but only without flash, or that only digital cameras are allowed, or only digital cameras with the fake-shutter noises turned off, etc. As long as the rule applies to everyone equally, it's fair, even if you think the rule is silly.

For websites, it's not fair to have different rules for Google than others. What would be fair is some kind of rule about how often visitors can visit, how much they're allowed to download, etc.

Personally, though, I think all this is total BS. Sites are open to the public, but they also serve the whims of their owners. If the site wants to prevent access to people from a certain IP range, that should be their right. If they don't want any scrapers, that should be their right too, or if they want to allow Google and not anyone else, that should also be their right. What isn't right is that they can use the government to enforce these arbitrary rules. If they want to block my scraper, that's fine, if they can do it on their end technologically. If they want to block my IP, they can do that too. But suing me or having the cops come to my door because they're too incompetent or lazy to do these things technologically is unacceptable. The role of government is not to enforce arbitrary policies made up by business owners.


Using the law to block crawlers is more like saying:

1. Google can come in

2. Other Americans can't come in

3. Chinese people can come in (or anywhere else where US laws don't apply)

It might not be unfair, but it is certainly pointless and arbitrary.


To be fair, many companies which take anti-scraping seriously will also take inputs like geographic origin of a request into consideration when applying request throttling and filtering.


Google is basically algorithms built on top of a scraping service. It's unfair to competitors (and potential disruptors) to restrict access to data that Google can fetch without limits.


Maybe we (scrappers) just need to market ourselves as search engines. _Indexing_ is what we're doing. :)


Exactly, market scrappers as search engines.

And all smart websites should include a ToS that says you are not allowed to access their data, so they can sue for trespassing anyone that they don't like selectively.

The far reaching of government into this, and also the pirating stuff (which I do not condone but think that arresting people for that is waay too much) is what makes me want for the system to collapse under it's own weight. Like some website suing members of congress for visiting it while violating the ToS in this case.

I also secretly wanted Oracle to win vs Google so that cloning an API was piracy and that would extend to being a crime to purchase pirated goods which would make all clean room reverse engineering a criminal activity. That would lead to anyone that uses a PC without an authentic IBM BIOS (look up Phoenix BIOS) to be arrested, in theory, so even the US president would have to fall into that. It would have been a glorious shitstorm if Oracle won and IBM took that precedent to it's logical implications, the computer world would have failed, and the law would either be made even more arbitrary or be fixed, but at least it would be shown how idiotic the state of affairs was.


Your idea about Oracle winning and society coming crashing to a halt is ridiculous and wouldn't have happened. Your flaw is believing that the law and the government will work with logical precision, so that a flaw in the law will, like an infinite recursive loop in programming code, cause complete disaster. It doesn't work that way. There's plenty of cases where the law is clearly broken (see civil forfeiture vs. the 4th Amendment to the US Constitution), yet nothing is done. That's because the government is run by humans, and they'll enforce things the way they want. Double standards happen all the time with law, and it takes big, expensive court cases to sort them out, and of course that only happens when some moneyed interest wants to fix it (which is why civil forfeiture is still a big thing--they're not going after extremely wealthy people or corporations with it). While IBM is certainly large enough to bring a big case like you suggested, the US government is far bigger and can simply invent a legal way of ignoring them, just as was done when the SCOTUS decided to rule in favor of using Eminent Domain to seize private property to hand over to commercial interests.


because i may also come to the point where i am a direct competitor to google, but i will never get there because i can't scrap any site like they can.

your next argument may very well be a very racist one with the very same excuse you used above.


And if you have some way to identify yourself as a potential competitor to google and not some jackass trying to scrape email addresses or spam comments forms, I'm all ears.


A majority of the websites that blekko, a google competitor, contacted to ask for robots.txt access ignored us.


I agree, it's a difficult conundrum. It sucks.


There are worse barriers to entry for a search engine! DMCA take-downs... Right to be forgotten... click history...


Worse is that Google tries to stop scraping. It's like they don't want anyone to see past the first page of results.

They could scrape your website and then they prevent you form scraping your own data back.

The whole process is silly; it reflects the duct tape and chicken wire nature of the www.

No one should have to "scrape" or "crawl".

Data should be put into a open universal format (no tags) and submitted when necessary (rsynced) to a public access archive, mirrored around the world.

This to bridge the gap until we reach a more content addressable system (cf. location based).

Clients (text readers, media players, whatever) can download and transform the universally formatted data into markup, binary, etc. -- whatever they wish, but all the design creativity and complexity of "web pages" or "web apps" can be handled at the network edge, client-side.

"Crawling" should not be necessary.

No one should have to store HTML tags and other window dressing for data.

Dream on.


That's the antithesis of the world wide web because you've just centralised data storage, which makes someone 'own' the www.


I do not understand your argument.

To give an example, there is a lot of free open source software mirrored all over the internet, mostly on ftp servers, but also on http, rsync, etc.

If you use Linux or BSD you probably are using some of this software. If you use the www, then you are probably accessing computers that use this software. If you drive a new Mercedes you are probably using some of this software. There are a lot of copies of this code in a lot of places.

Is that centralized? Does anyone hosting a mirror ("repository") "own" the software? Is it the same person or entity hosting every mirror?

Compare Google's copies of everyone else's data, also replicated in a lot of places around the world. Who "owns" this data?


Double standard? The difference is that Google Bot is built on being unobtrusive. I can easily built a scraper that will quickly ddos a site. Linkedin for example...if they allow 10,000 people to send 100 scraping requests per second everyday then that is stolen bandwidth that Linkedin has to pay for and the scrapers get free data. The difference is that Google has standards in which site's unusually benefit from, not to mention that they allow for you to disallow their bot. It just doesn't work the same way with some random developer building a scraper.


I agree that Googlebot is well behaved. When it detects your site is slowing down, it will back itself off. Unfortunately, this is often to your detriment.

In my experience, on a large site, Google will often slurp as much as you let it, upwards of hundreds of pages per second.


Google usually does 300 pages a minute on my site. In total, bots were loading about 1,000 pages a minute.


That is a lot! It's also still an order of magnitude less than big content sites. Not taking anything away from what must be a successful website to get a consistent 300 pages/minute crawl rate, but only to illustrate magnitude.


I was curious, so I just checked the stats through webmaster tools. For the last 90 days, the low is 450,000 daily crawled pages, average is 650,000, and yesterday was the high of 1,130,000 (780 per minute). Ouch.


Have you seen correlation between rankings and crawl rate?


This particular site is top 5,000 Alexa. The content changes every minute, and Google is fast at picking up those changes. The last cache of the homepage was 7 minutes ago from Google.

There's definitely a correlation between my sites' Google rankings, their organic traffic, and their crawl rate. The other sites I run are Alexa top 30,000 and top 100,000. They all feature dynamically changing content, but Google is definitely using a higher crawl rate on my higher ranking sites. This isn't a surprise though, Google has limited resources like everyone, and they'll focus those resources in a way that provides the most benefit.

Edit: If you're talking about the correlation between daily ranking and daily crawl rate for an individual site, then no, I'm not aware of any patterns. For example, the graph is flat for organic traffic and total indexed pages, but the crawl rate jumps up and down as mentioned, and it doesn't appear to relate on a daily basis.


I've seen rankings drops following drops in crawl rate.


Google and others, with legitimate reasons obey robots.txt


This post is kind of crazy, aggrandizing bad behavior and misuse of other's resources against their will.

Scraping against the TOS is super bad netizen stuff, and I dont think people should be posting positive reviews of people doing this. Breaking captchas and the like is basically blackhat work and should be looked down upon, not congratulated as I see in this thread.


>Scraping against the TOS is super bad netizen stuff, and I dont think people should be posting positive reviews of people doing this. Breaking captchas and the like is basically blackhat work and should be looked down upon, not congratulated as I see in this thread.

Not really.

Scraping, in my opinion, isn't black hat unless you are actually affecting their service or stealing info.

If you are slamming the site with requests because of your scraping, yeah you need to knock it off. If you throttle your scraper in proportion to the size of their site, you aren't really harming them.

In regards to "stealing info", as long as you aren't taking info and selling it as your own (which it seems OP is indeed doing), that is just fine.

tl;dr: Scraping isn't bad / blackhat as long as you aren't affecting their service or business.


> If you throttle your scraper in proportion to the size of their site, you aren't really harming them.

And do you understand their site infrastructure to know whether you're doing harm? It's perfectly possible that your script somehow bypasses safeguards they had in place to deal with heavy usage, and now their database is locking unnecessarily.


Eh, this is pretty weak. Scrapers are no different from other browsing devices. The web speaks HTTP. There's no reason that using another HTTP browser would cause any disparate impact just by virtue of not being a conventional desktop browser -- you've thrown out a pretty absurd hypothetical. In fact, scrapers usually cause less impact because they usually don't download images or execute JavaScript.

I did an analysis and a session browsed with my specialized browser would always consume less than 100K of bandwidth (and often far less), whereas a session browsed with a conventional desktop browser would consume at least 1.2 MB, even if everything was cached, and sometimes up to 5 MB. In addition, on the desktop, a JavaScript heartbeat was sent back every few seconds, so all of that data was conserved too.

Because we were a specialized browser used by people looking for a very specific piece of data, we could employ caching mechanisms that meant that each person could get their request fulfilled without having to hit the data source's servers. We also had a regular pacing algorithm that meant our users were contacting the site way less than they would've been if they were using a conventional desktop browser.

Our service saved the data source a large amount of resource cost. When we were shut down, their site struggled for about two weeks to return to stability. I think they had anticipated the opposite effect.

Our service also saved our users a large amount of time. We were accessing publicly-available factual data that was not copyrightable (but only available from this one source's site). There's no reason that the user should be able to choose between Firefox and Chrome but not a task-specialized browser.

It is true that some people will (usually accidentally) cause a DDoS with scrapers because the target site is not properly configured, but the same thing could be done with desktop browsers. It doesn't mean that scrapers should be disadvantaged.


A small counterpoint to this -- in the airline industry, it's relatively commonplace for seat reservations to be made for a user _before_ payment has occurred. In this case, if you're mirroring normal browser activity, you can (temporarily) reduce availability on a flight, potentially even bumping up the price for other, legitimate users, and almost certainly causing the airline to incur costs beyond normal bandwidth and server costs. I'm sure there are many other domains for which this is also the case, however rare.


If they don't do the seat reservation behind a POST, or at least blacklist the reservation page in robots.txt, I have no sympathy.


I've had this happen regularly enough that I developed the habit of finding a fare in the morning and then returning at 11pm ready to buy.


Cinemas and some online shops do the same. I've always wondered if it's possible to block any tickets sales for entire flights/screenings/products this way.

And if airline tickets are based on supply v demand, it might even be possible to drive down ticket prices by suddenly dropping a load of blocks near to the flight date.


If you have ever tried to buy hot tickets online and not been able to get any, this is due to bots. Bots are the scalpers friend.

This can easily be prevented by requiring ID matching the ticket on entry, but the ticket sellers often don't seem to care.


> you've thrown out a pretty absurd hypothetical

Not even remotely absurd. Where is the data your scraper consuming coming from? It's almost always served from some sort of data repository (SQL or otherwise). That data costs far more per MB to serve up quickly than JS/CSS/images.

Suppose, for example, you host a blogging platform that has one very popular user. Most accounts on your site don't get a ton of visitors, and that one very popular user's post are all stored in cache.

Then along comes a scraper. He thinks, "Hey, this site is serving up a million page impressions a day. It can definitely handle me scraping the site".

But when he runs the scraper, he fills up the cache with a ton of data that it doesn't need, causing cache evictions and general performance degradation for everyone else.


There are already 6-8 major scrapers that do this constantly, across the whole internet, called search engines. You can't handle that?

What if you get a normal user who says "Hey, I wanna see some of the lesser known authors on this platform" and opens up a hundred tabs with rarely-read blogs? What if you get 10 users who decide to do that on the same day? Is it reasonable to sue them? Should there be a legal protection to punish them for making your site slow?

Don't blame the user for your scaling issues. If the optimized browser ("scraper") isn't hammering your site at a massively unnatural interval, it's clean. And if it is, you should have server-side controls that prevent one client from asking for too much data.

These are just normal problems that are part of being on the web. It's not fair to pin it on non-malicious users, even if they're not using a conventional desktop browser.


Search engines respect robots.txt – not sure many scrapers do.


First, search engines are scrapers. No need to make a distinction.

Second, search engines don't always respect robots.txt. They sometimes do. Even Google itself says it may still contact a page that has disallowed it. [0]

Third, robots.txt is just a convention. There's no reason to assume it has any binding authority. Users should be able to access public HTTP resources with any non-disruptive HTTP client, regardless of the end server's opinion.

[0] "You should not use robots.txt as a means to hide your web pages from Google Search results. This is because other pages might point to your page, and your page could get indexed that way, avoiding the robots.txt file." / http://archive.is/A5zh8


In the Google quote you link to, Google is not contacting your page. Rather, Google will index pages that are only linked to, which it has never crawled, and will serve up those pages if the link text matches your query. That's how you get those search results where the snippet is "A description of this page has been blocked by robots.txt" or similar.

There's a somewhat related issue where to ensure your site never exists in Google, you actually need to allow it to be crawled, because the standard for that is a "<meta name=noindex ...>" tag, and in order to see the meta noindex, the search engine has to fetch the page.


And the original point of my comment was that doing this is extremely rude and not appropriate, not that it couldn't be done or that others weren't doing it.

Feel free to send any request to any server you want, it is certainly up to them to decide whether or not to serve it, but that doesnt absolve you of guilt from scraping someone's site when they explicitly ask you not to.


Please don't conflate "extremely rude", "not appropriate", and "guilt". Two of these are subjective opinions about what constitutes good citizenship. The last one is a legal determination that has the potential to deprive an individual of both his money and liberty. We're discussing whether these behaviors should be legal, not whether they are necessarily polite.


I never did.

You are posting in a comment thread underneath my reply about rudeness and impoliteness, ironically being somewhat rude telling me off about what not to conflate when it was never what I said.


Google will put forbidden pages in its index. It doesn't scrape them. (The URL to the page exists even without visiting the page.)


We do, and we also use our own user-agent string: "SiteTruth.com site rating system". A growing number of sites reject connections based on USER-AGENT string. Try "redfin.com", for example. (We list those as "blocked"). Some sites won't let us read the "robots.txt" file. In some cases, the site's USER-AGENT test forbids things the "robots.txt" allows.

Another issue is finding the site's preferred home page. We look at "example.com" and "www.example.com", both with HTTP and HTTPS, trying to find the entry point. This just looks for redirects; it doesn't even read the content. Some sites have redirects from one of those four options to another one. In some cases, the less favored entry point has a "disallow all" robots.txt file. In some cases, the robots.txt file itself is redirected. This is like having doors with various combinations of "Keep Out" and "Please use other door" signs. In that phase, we ignore "robots.txt" but don't read any content beyond the HTTP header.

Some sites treat the four reads to find the home page as a denial of service attack and refuse connections for about a minute.

Then there's Wix. Wix sometimes serves a completely different page if it thinks you're a bot.


> I did an analysis and a session browsed with my specialized browser would always consume less than 100K of bandwidth (and often far less), whereas a session browsed with a conventional desktop browser would consume at least 1.2 MB, even if everything was cached, and sometimes up to 5 MB. In addition, on the desktop, a JavaScript heartbeat was sent back every few seconds, so all of that data was saved too.

Bandwidth is certainly part of it, but there's also also database and app-server load (which may be the actual bottleneck) that a scraper isn't necessarily bypassing.


Yeah, I just have a hard time buying that a scraper that does less than a conventional desktop browser is going to accidentally stumble across something that causes the server-side to flip out. I'm not really sure in what case your hypothetical is plausible.

Scrapers are usually used to get publicly-available data more efficiently. What you're describing would basically require the scraper to hammer an invisible endpoint somewhere, but there's no reason the scraper would do that -- it just wants to get the data displayed by the site in a more efficient manner. I suppose the browser could enforce a cooldown on an expensive callback via JavaScript, which a scraper would circumvent, but IMO that's not a fair reason to say scrapers are disallowed; cooldowns should be enforced server-side. There's no way to ensure that a user is going to execute your script. That's just part of the deal.

Everything about scrapers means less server load; no images, no wandering around the site trying to find the right place, no heavy JavaScript callbacks that invoke server-side application load, etc. Scrapers are just highly-optimized browsing devices targeting specific pieces of data; it's logical that they would be cheaper to serve than a desktop user who's concerned about aesthetics and the like.

In our specific case, those JavaScripts we didn't download included instructions to make over 100 AJAX requests on every page load. No wonder users were looking for something more efficient.

So I agree that a scraper isn't necessarily bypassing some load-heavy operations, but I find it highly implausible that a non-malicious scraper would be invoking operations that cause extra load (beyond just hitting the site too often). Frankly, I'd be surprised if there was a functional scraper that regularly invoked more resource cost per-session than a typical desktop browsing session to get equivalent data.


> What you're describing would basically require the scraper to hammer an invisible endpoint somewhere

That wasn't my point. My point was: a lot of a website's costs are hidden from a web scraper (e.g. database load), so a scraper can't claim, based on the variables they can observe (bandwidth), that they're costing the website less than normal traffic.

I was basically responding to statements like this:

> In fact, scrapers usually cause less impact because they usually don't download images or execute JavaScript.

There's really no way for a scraper to know that unless the website tells them. Their usage pattern is different than typical users and raw bandwidth (for stuff like static images) may not matter to the website.


It's true that there's no way to know that for sure, but it doesn't make sense that a scraper, by virtue of its being a scraper, is incurring additional load. A scraper is only making requests that a person with a desktop browser or any other appliance that speaks HTTP could make. What's the difference between a user clicking the same button on the page 50 times or holding down F5 and a scraper that pings a page once a minute?

Your argument is basically boiling down to "scrapers could hit one load-heavy endpoint too fast", but so could desktop browsers. So I don't see what it has to do with scraping.


> but it doesn't make sense that a scraper, by virtue of its being a scraper, is incurring additional load

It does, because scrapers don't have normal usage patterns. They're robots and behave like robots.

> What's the difference between a user clicking the same button on the page 50 times or holding down F5 and a scraper that pings a page once a minute?

Typical users aren't usually in the habit of mashing F5, especially not for robotically long periods of time. It's basically the difference between a theoretical activity and an actual activity.

Basically, scraping is not regular usage, and I don't think it's correct to pretend that they're equivalent (or more extremely, that scraping is less costly to the website).


Scrapers are usually coded to have as regular of a usage pattern as possible, so that the data they retrieve is as much like the data the end user would receive as possible.

For example, Googlebot does everything in its power to ensure that it sees pages the same way that end users sees them, executing JavaScript and performing OCR to try to read information conveyed in images. Google also has non-Googlebot scans to try to determine if a page is serving different content to Googlebot-labeled scans, and they penalize sites that they suspect of doing this.

While it is true that someone could write a scraper that obviously behaved robotically, it is also true that someone could use their desktop browser in a robotic way. Mashing F5 is so common that there are many ancient memes referring to and making jokes about that activity. There are extensions that end users use to record browser macros, behaviors they want their browser to repeat over and over again.

However, this conversation about whether scrapers behave robotically or not is moot because a web site shouldn't break down under load when someone uses it in a slightly-irregular way. The obvious, crappy scrapers are trivial to block. The ones that blend into the traffic are no harm, no foul. If you can't tell the difference between an optimized browser like a scraper and a general-purpose browser like Chrome, why shouldn't it be allowed to talk to your site?


> Typical users aren't usually in the habit of mashing F5, especially not for robotically long periods of time. It's basically the difference between a theoretical activity and an actual activity.

Just like every university site ever is completely down during signup days because everyone is mashing F5.

Link me your site, I’ll treat it like a college student waiting to be able to sign up for their classes.


Have run into exactly this before. Wrote a scraper that retrieved results from a trivia league website. Tried to be a polite scraper (<1 request per second) but the site still crashed - even with 5 seconds of sleep between requests. They were doing something weird with DB connection management (maybe just forgetting to close it and letting it timeout? I remember figuring it out but it's been quite a while) and so after N very reasonably spaced queries the site would reproducibly start throwing an uncaught MAX_DB_CONNECTIONS_EXCEEDED and just be down for everybody everywhere who might've wanted to use it.


It seems like you could easily hit those scaling issues by manually browsing the website. While I agree that it sucks to take down a site by scraping, in that specific case it sounds like the performance issues are their fault and not yours. That said, once I realized the effect my scraping had, I would (hopefully) cease my scraping.


So the thing is, I could totally believe they never saw this traffic pattern under normal load. I'd expect bar trivia scores in a certain mid-sized US city are one of those niche things where you have a very low number of uniques but each unique then pokes around on 9 or 10 pages while they're there. The fact that the site didn't crash during normal browsing was what originally led me to speculate they were maintaining an open DB connection per session. If that was indeed the issue, I could totally imagine they'd only rarely (never?) had 100+ "concurrent..ish" unique visitors.


Ok, then why couldn't you revise your scraper so that it did everything in a single session, to avoid this problem?

To me, for private, personal use, a scraper should emulate a normal human browser as much as possible to avoid causing site problems and to avoid detection. If what you're doing can be done in the background, or by a cron process at some odd hour, it doesn't have to be fast at all, and you can set the timings to be similar to a normal human.


Now that I think about it a bit more, I think my hypothesis was that DB connections were allocated at the session level and that without cookies enabled each request initiated a new session.

I'd consider that a bug not a feature but I still think it's incumbent on me, the guy scraping the website, not to trigger it.


That is a classic connection pooling/lifecycle bug, and usually one that gets caught in the first few days of having multiple people utilizing a product/service, worst case.

If someone's production site, thats been around for while, had a bug like this that can be caused by what you describe, I'd love to see how many real users they have. I'm sure its possible under certain circumstances, but its definitely bad engineering that would be caused by literally any traffic.


You can avoid triggering this in your scraper by activating a cookie jar. Pretty simple most of the time. Even commandline cURL and wget support it. I'm sure you figured that out already, but just for anyone who's wondering. ;)

That said, while obviously you want to avoid triggering the bug since it offlines your data source, this is definitely in the site's court to fix and could easily be triggered by normal usage. Some people browse with cookies disabled, especially since the EU passed its "cookie law", requiring sites to get consent before storing a cookie on visitors' machines. If you've started to notice more sites talking about cookies over the last year, that's why. [0]

[0] http://ec.europa.eu/ipg/basics/legal/cookies/index_en.htm


>Now that I think about it a bit more, I think my hypothesis was that DB connections were allocated at the session level and that without cookies enabled each request initiated a new session.

Could also be something like storing hibernates second level cache in session. Unfortunately I've seen this, a significant chunk of the database was being copied into each users session.


I mostly agree with the post's author on the "I'm just automating something I'd otherwise be doing manually". If the local weather service publishes, say, barometric charts on their site - but has a TOS that prohibits me from scraping, and my alternative was to just hit their site every day and right-click and save-as on the chart - I feel absolutely no compunction in automating that. You need to be careful of the slippery slope though, when it's easy to grab every day's local barometric chart, it become too easy to think "Hey, I just need to stick that in a loop and I can grab 1000 different charts every day!". I'd personally _not_ do that. If it's something I'm likely to do "by hand" but would occasionally miss a day or three, I'll automate it no matter what the TOS says.


You're saying the site should pay him a consulting fee for the free load-testing service he provides?


One fair baseline is whether or not the custom User Agent you're using to scrape has request timing that's on the order of what a fast human visitor might do. If the site can't handle that, it's certainly not the UA's fault.


> Scraping, in my opinion, isn't black hat unless you are ... stealing info.

And as a webmaster, how can I tell the difference before it's too late?


> tl;dr: Scraping isn't bad / blackhat as long as you aren't affecting their service or business.

Analyzing data that you're not allowed to access gives you/your company a competitive advantage, which is affecting their service/business even if it's not posted/distributed publically.


I don't follow your argument. How does one get their scraper access to data they would otherwise not be able to access through 'normal' browsing techniques?


Example I know of: You can scrape your competitor's Facebook pages since their creation and output nice graphs of which posts generated what kind of likes and suscriptions. This data is usually limited to the owner of the page.


But I can get the same data with manual browsing. And I can just as well pay some chinese to collect the data and input it in my database.

Automated scraping is just a way to drastically reduce labor costs for information collection. Sure, it's a competitive advantage, but I think disallowing it or calling it unethical is a pretty big can of worms. Why is it ok if something is done by humans but not ok if a computer does it by himself?


If someone was to sit down and use paper + pencil + time to accomplish the same thing, would you still have issue with it? It's publicly available data. Should you also not watch your competitors television ads or walk in to their physical store and browse around?


The difference being that I can scrape the data of ALL my competitors since 2008, get the number of likes and comments of every single comment and automatically generate graphs of the data. Thousands of man hours in one minute.

This gives an unfair advantage to the tech-savvy "hackers". Facebook terms protects against this. Thus scraping it is disallowed.

I couldn't say if it would be moral or immoral to do this. Personally, I'm more concerned about the well being of poor scraper program that has to scrape through an entire decade of Facebook posts. Poor thing.


This is called hustling. I love it. What's not to like?


By ignoring robots.txt and bypassing capchas.


'Bad netizen stuff'? Is this a comment from 1997? Breaking captchas is 'blackhat'? What cozy hippy internet alternative reality does this come from?

These are the same websites and companies that are loading evercookies and doing browser fingerprinting, that break as much as possible the anonymity citizens should enjoy, with Real Name policies, using network analysis to find who your friends are and what your politics and buying habbits are, that routinely rip private information from you cell phone and share it with oppressive regimes.

You're not in Kansas anymore Toto.


> misuse of other's resources against their will

Nonsense, there is no implication that this activity is illicit. Many sites (I have worked with hundreds) are happy to be included in my service, but don't have the technical ability to provide a data feed. They were delighted when I told them I could aggregate their content without any extra work on their part.

We respect TOS, we respect robots.txt and so on. Just because you study scraping techniques doesn't mean you intend to break the law.

> Breaking captchas and the like is basically blackhat work

Um, captchas only work if they work. If breaking them is trivial, they shouldn't exist. Don't shoot the messenger for pointing out the front door is unlocked.


"Don't shoot the messenger for pointing out the lock on your door can be picked"


"Don't shoot the messenger for pointing out that robots are capable of opening doors."


Sometimes, scraping a website is the only way at your disposal to fetch relevant information, you've paid for (or not). It could be simply the opening schedule of your local administration or the status of different pieces of public infrastructure.

If your administration don't have the resources (and it's often the case) to maintain a proper JSON API for you to fetch with a fancy python lib, then, it's not "super bad netizen stuff" to scrap a few HTML/PDF/XLS, parse them and display them for convenient public consumption on your personal website (and paying for the bandwidth).

It's 2016. State-companies holding a third party responsible for their own outages and poor planning is _bad faith_[1]. ETL? Never heard of it?

[1]: https://citymapper.com/i/1208/soutenez-citymapper-et-lopen-d... (french)


To know the TOS of a page, you need to read it. To know which links are part of a site and which are not, you need to follow the link. Having a TOS as part of the page content is akin to having a sign in a room, only readable by entering the room, that says "you are not allowed to enter this room".

Yes, this defense is being petty abotu details, but I find businesses using post-hoc discoverable limitations to limit people rights annoying.


Instagram or Facebook, they thrive on stolen or relinked content and monetize it day in day out.

Being amazed at this kind of bad behaviour where the targets are some of the most despicable companies on the web is a bit ironic. Scrape away, these companies hurt the web, let's hurt them (even though, all the scraping in the world won't have any impact).


So its moral to continue bad behavior because someone else did it?


It's not bad behavior. The companies that profit off this try to make you think it's bad behavior because they don't want to risk your taking any profit away, and they've installed laws that let them get away with this. They can violate to their heart's content, but unless someone else in the oligopoly sues over the matter (which they would never do, because the precedent may prevent their abuse of the law), the peons will be forced to comply. That's not how a competitive marketplace works, and it's why we have such a hard time breaking gridlock on web properties.


> This post is kind of crazy, aggrandizing bad behavior and misuse of other's resources against their will.

How so? I send a web request, they send me the content in a response. If they aren't happy with that then they should refuse my request.


I disagree. DOSing a site is bad behaviour, regardless of how you do this. But accessing it in an automated way instead of a browser? Not really. The deal on the Internet is like this: a website owner can provide whatever they want, and a visitor can read it however they want. Discriminating visitors based on whether or not they seem to be bots instead of people is going beyond what the site provider should do. So is detecting and blocking people using adblockers.


I agree with your assertion about "the deal on the Internet", but I disagree with your conclusions. IMO, site owners should be able to discriminate all they want. However, when someone's browser (or other software...) makes a request to that site, and the site serves them some data, the user should be able to do with that what they will: either honor or decline the requests for them to download various ads or JavaScript, for instance. It should be up to the site owner to craft their site to follow their policies and whims. What I'm completely against is the idea of using the government and law enforcement to enforce some site owner's policies. The only exception I can see for this is extreme cases where this general principle falls down: DOS attacks, for instance.

If I can modify my web browser to view a site, but skip the ads, that should be my right. If the site owner codes their site to detect this and then blocks my request to see their site, that should be their right. If I modify my ad-blocker to get around their ad-blocker-block, that should be my right, and so on. As long as we don't get into something like DDOS territory where a reasonable web site has no good technological way of avoiding the problem caused by a user, this isn't something for government to get involved in.


Hmm... yeah, I guess what you (and the other commenter) describe is fairer than what I wrote. Thanks!


IMO all's fair as long as the solution is technical. Dragging it into the courts because you can't figure out how to stop them technically (especially if they're not actually disrupting anything) is inappropriate.

We need updated legislation that covers malicious actors that issue DDoS attacks but leaves normal people that scrape consciously and carefully alone.


So you think that people should waste their time solving captcha are the solution? people are paid to solve captcha, there's always something like that and then the users suffer more. It's not a solution at all.


How can TOS have legal power for the case scraping? A website is a public property. If I'm visiting it without logging in, I don't have a chance to accept TOS.

Imagine a hotel that makes guests sign a document saying they will not make photographs of the building. If I'm not a guest, I can take photographs of it and I can't even know that would be illegal.


The UK has a database law:

https://en.wikibooks.org/wiki/UK_Database_Law#Database_Right

If you scrape, and effectively reconstitute a database, then so long as the database originally had a "substantial investment" in it's "obtaining, verifying or presenting the contents" then yup... you have breached the database right, which is a modified form of copyright.

You may access said database (via the web), but as soon as you start reconstituting the database from scraping... you're in breach.

It's a law, it is illegal in the UK, I'm sure most countries have some equivalent law on their books, all of the EU does. The law looks recent, but UK copyright and patent used to cover it, the 1997 date is just a separate statute to clarify the position.


Actually, such database laws are rare. The US and Canada don't have one. See Feist v. Rural Telephone for an example of databases getting scraped & the scraper winning in court.


Actually, the US does have one.

The World Copyright Treaty of the WIPO, which the US also signed, enforces in Article 5 that every member country has to have a database law of this kind.

    ________________________
> Article 5: Compilations of Data (Databases)

> Compilations of data or other material, in any form, which by reason of the selection or arrangement of their contents constitute intellectual creations, are protected as such. This protection does not extend to the data or the material itself and is without prejudice to any copyright subsisting in the data or material contained in the compilation.

http://www.wipo.int/wipolex/en/treaties/text.jsp?file_id=295...

> United States of America

> Signature: April 12, 1997

> Ratification: September 14, 1999

> In Force: March 6, 2002

    ________________________
In fact, this fucking treaty is the only reason so many countries even have that at all – the EU didn’t have any Database Law before it was created, and the US threatened (as always) to boycott any country not signing.


To be clear, this wasn't a scraper in the networked computer sense. It's actually a perfect example of how meatspace safeguards don't translate because law is not equipped to handle the nature of cyberspace.


I don't see how that's true at all. Running a meatspace telephone book through a sheet-fed scanner and OCR isn't wildly different from scraping a website.


It's different because you don't contact another party's server to do it. The CFAA makes it illegal to "exceed authorized access" to networked computers. "Authorized access" is whatever the server's owner says it is. That's why the copyright status of factual accumulations isn't a protection for internet scraping.

If Feist v. Rural occurred now and Rural, like most companies, kept their information in a database online, Feist would lose not for copyright infringement, but for exceeding authorized access to Rural's server.


You've made a large number of authoritative-sounding comments on this story... and this one, like many of the others, is a guess.


To the extent that I can't tell what would actually happen in an alternate future where Feist v. Rural occurred in the digital realm, sure. To the extent that the CFAA allows companies to make that type of determination today in the actual timeline, no, it's not a guess (and I have the wrecked business to prove it).


Exactly. You can't copyright facts.


You can't copyright facts in the US. You effectively can in the EU, as the grandparent discussed, as long as you demonstrate that it took significant investment to arrange the compendium of facts from which they were drawn.


What is the definition of "reconstituting a database"? Aren't googles indexes doing that?


Yes. Google is violating practically every law of this type. They're allowed to do it because they have a lot of money.


You may still record the responses you receive from such a database and use it for your own purposes. The database law only restricts making a duplicate database available to the public.


What happens if you use that data to create entirely new database? Let's say, can I create a database of people who work at Google and like ice cream by scrapping linkedIn and Facebook?


> A website is a public property.

This isn't even true metaphorically. It's like a shop front: there may be public access, but it is NOT public property.


"public property" may not be the correct metaphor. But neither is "shop front" correct.

Taking the store metaphor further, it would be more like you knocking on the front door of a clothing store and the store owners open the door and throw every possible piece of clothing at you, shirts, shorts, underwear, including coupons to "partner" stores, when all you wanted was a pair of pants.

Upon knocking, if the store owner hands you instructions on how to enter their store and interact with their products in a personalized shopping experience, that would be one thing. But when the clothing owner throws everything at you at once, what they flung at you is for all practical purposes public property.


>How can TOS have legal power for the case scraping? A website is a public property. If I'm visiting it without logging in, I don't have a chance to accept TOS.

This is called "clickwrap". There is usually a notice in the footer of each page that says something like "By using this site, you agree to our Terms of Service." Typically, this kind of notice has been held enforceable. More recently, judges have been demanding that such notices be placed more prominently before they're held enforceable (e.g., somewhere above the fold), but that's it.

>Imagine a hotel that makes guests sign a document saying they will not make photographs of the building. If I'm not a guest, I can take photographs of it and I can't even know that would be illegal.

The reasonable laws that exist in meatspace are not applicable online, because once you hit someone else's server, you're considered to be on their property and they have the right to control what you do there. There is no "public property" from which to safely stand and take photographs in the internet.

Also, photographs of structures may not be free to use. Architectural copyrights went into effect in the early 90s and have a term of either 90 or 120 years. Thus, if you take a photograph of a building built in 1991 and the year is not yet 2111, there is a chance that the architect can claim infringement.


I have a custom X-TOS header in all of my http/https requests that states that the company who owns rights to the website my request is sent to and replies with data owes me:

1. Total privacy, they will not track me activity on their website, including any logs.

2. They will send me a cashier's check for $1,000 for each byte that they send to me.

3. They will provide me with Mana Sakura's cell phone number.

I'm still waiting for checks and a phone number.


If you can convince a judge that this represents an enforceable contract, as has been done and established with clickwrap, then you should be able to get what you're owed. :)

It is ridiculous. Something like "pagewrap" can't trump the consumer protections that apply to a physical good like a book, it would be laughed off. But the law doesn't contemplate network access so reasonably.


> Thus, if you take a photograph of a building built in 1991 and the year is not yet 2111, there is a chance that the architect can claim infringement.

The architect can claim infringement all they want, they don't have a case. From https://www.law.cornell.edu/uscode/text/17/120 :

The copyright in an architectural work that has been constructed does not include the right to prevent the making, distributing, or public display of pictures, paintings, photographs, or other pictorial representations of the work, if the building in which the work is embodied is located in or ordinarily visible from a public place.


Thanks for pointing this out. I had run across this before, but I guess I disregarded it as I felt that for many uses, "ordinarily visible from a public place" is vague and would still result in an action where one would be forced to prove that the depicted building is ordinarily visible from a public place in court. I would like to know if a "public place" means exclusively public property or if it means private property onto which the public is welcomed, and whether a pictorial depiction is infringing if the subject is rendered as it would be seen from a "private place", whatever that is, especially if the artist or photographer had not been on that private place personally.

This is an important caveat to architectural copyright, however, so thanks for clarifying.


"Public place" is indeed a vague term. In some statutes, it has a definition that includes publically accessible private property such as common areas of businesses, hospitals, etc. Does it include places where access depends on payment of a fee? How high above the ground do public places extend (thinking of photography drones)?

See also https://mentalhealthcop.wordpress.com/2013/09/20/place-to-wh... which shows the same ambiguity exists in the UK.


The reasonable laws that exist in meatspace are not applicable online, because once you hit someone else's server, you're considered to be on their property and they have the right to control what you do there. There is no "public property" from which to safely stand and take photographs in the internet.

IANAL but this seems perverse. In no meaningful sense am I on corporate property when my computer in my house sends signals to another computer, formatted so that they will be re-sent in turn to a series of other computers, the last of which decides on its own based entirely on the signal it receives from the penultimate host to send a "response" to a different series of other computers, the last of which is my computer in my house.

Surely there are better ways to enforce IP restrictions than this tortured analogy of networked computing to physical location?


"Clickwrap" refers to situations where you have to click through before using the service, hence the name. Agreements which are simply a passive notice in a footer somewhere are called "browse-wrap", and are much less likely to be considered enforceable:

https://en.m.wikipedia.org/wiki/Browse_wrap


The line is blurred between clickwrap and browsewrap -- those are colloquial terms to describe ToS notices, not legal terms. Is it still browsewrap if you say "By clicking any of the links on this site, you agree to the ToS"? How far away from the clickable buttons must the statement be to be browsewrap instead of clickwrap? The distinction is really only a technicality in the wording, not anything substantive. In practice, you are still being forced to agree to a binding contract (many of which remove one's right to sue in a court of law) just by going past a landing page.

Even if we entertain a distinction between browsewrap and clickwrap, browsewrap is generally enforceable, especially after minor modifications to placement and/or font size.


That notice is typically in the footer, and a screen reader will reach the nav-bar before mentioning the TOS notice.

Even for sighted people, the notice is often easy to miss - and this is by design.


It's by design because the vast majority of people don't care about that information and it makes the website worse for them to have a big ToS banner at the top of your page.

I don't think many websites have a secret ToS that they hope you won't read, I think most of them don't even know what their own ToS say. I signed my lease on a site with an explicit checkmark for ToS that said I agreed I would only use exactly IE7 to use their site.


Which is plenty amusing, until one of those companies is suing you in a court of law for violating said ToS.


Neither of us has explicitly mentioned a jurisdiction, but assuming you, like me, are referring to the United States...

I suppose I can do no better than quote from the Wikipedia page I linked:

> The Second Circuit then noted that an essential ingredient to contract formation is the mutual manifestation of assent. The court found that "a consumer's clicking on a download button does not communicate assent to contractual terms if the offer did not make clear to the consumer that clicking on the download button would signify assent to those terms."

The same page cites a number of cases where a "browsewrap" agreement was found unenforceable and only one where one was found enforceable - and the latter, for what it's worth, involved a sale taking place through the website rather than anything resembling passive browsing. Of course there exist other cases not listed; and there are situations that muddle the distinction between clickwrap and browsewrap. But at least, the very common pattern of, as you said, burying "a notice in the footer of each page" without anything vaguely resembling active consent, as practiced by probably the majority of commercial websites on the internet, seems to pretty clearly fall on the unenforceable side of the line based on those precedents.


Heh, fwiw, there is another one, shrink-wrap agreements, where you can't read the agreement until you've removed the shrink-wrap but doing so means you've agreed.


One of the original court cases covering this was eBay vs Bidders Edge.

https://en.wikipedia.org/wiki/EBay_v._Bidder%27s_Edge

The courts have generally disagreed with that interpretation.


> A website is a public property.

No, its not. It may be in public view, but that's a different issue.


That's an interesting analogy - though you're allowed to take photographs of whatever is in public view in many jurisdictions. Now if you wanted you could take this argument to the extreme, but surely there's some parallel between sending and receiving photons across the border of someone else's property (perfectly agreeable) and sending and receiving requests?


Both the LinkedIn and OKC cases involved the scrapers using logged in accounts.


> A website is public property.

This is a gross misunderstanding of how the internet works.


That analogy is not equitable. If you take photographs of a building while on the building's property, they have the right to tell you to stop, or call the police to escort you off if you refuse to do so.


Regardless of whether that would be reasonable, is it actually true? I know that the United States has specific rules for "public accommodations," which are private properties that are generally accessible to the public, like retail businesses. Property owners in this case don't have complete control over who enters their property. The obvious example is refusal of service due to membership of a protected class like race or religion.

So I'm not so sure that police will escort you out of a Walmart because they caught you taking a picture of the parking lot with your smartphone.


Let's go with a more apt analogy:

If you're entering a country, do its laws not apply to you until you've seen a copy of them? "Oh, sorry, no one told me theft is illegal here. Where does it say that? Oh, I see. Okay. I'll stop now. Thanks for letting me know."

If you cross the border without necessary documents, does that country have no right to detain you, simply because you haven't checked the laws?

Just because a website is visible and public doesn't mean its content is public domain. It just means that your first order of business as a user should be to check the terms of service. Sure, most people using a website probably don't need to--same as not needing to check a country's stance on murder--and so can just use the website as intended without violating the terms. But when you plan on using it in a way that might not be intended, and you don't check the terms of service, well, that's on you.


Country's laws are a bit different, simply because a country has virtually absolute legal power over its territory. Countries can and do punish people for breaking laws that one cannot feasibly know they were breaking. Does any human know all the laws in the United States? Would that even be physically possible?


There's some interesting science fiction opportunities here. When you open a connect to a site then all traffic over that connection is subject to the jurisdiction of the ToS for that site regardless of disclosure.

Also we don't even know how many laws there are in the United States for I'd say knowing the content is impossible.


That is not a good analogy. There is such a thing as reasonable expectations when visiting a website, so you do not need to read the TOS. Otherwise I could put "you own me $1000 for visiting my site" into the TOS. In other words, just clicking on a page does not constitute entering into a contract with the website. Registering and accepting the TOS does, but that still doesn't mean that anything in the TOS is enforceable.


But when you plan on using it in a way that might not be intended, and you don't check the terms of service, well, that's on you.

I don't need to check your terms of services if I'm doing something that I'm allowed to do by law anyway; the TOS cannot deny me those rights (they might, of course, grant me additional rights provided that I follow certain conditions).


Sure, but they do not have the right to retroactively declare you as having been trespassing, nor even to preemptively put up a "no photography" sign and have you arrested for trespassing if you disobey it.

The entire point of protocols is to precisely define the terms of communication. The status code is '200 OK', not '200 OK/Asterisk'. But of course if lawlers didn't force themselves into the situation, they'd be out of jobs.

As an aside, I'd really like to see a browser plugin that would scrape sites in the normal course of access, storing the proceeds in a distributed public database.


>As an aside, I'd really like to see a browser plugin that would scrape sites in the normal course of access, storing the proceeds in a distributed public database.

This would be copyright infringement, since the content of the page is a substantive unique work that is automatically copyrighted by its author. A site that doesn't want you scraping its content is not going to want you posting dumps of its pages. Much like BitTorrent, they'd get into the protocol and send subpoenas to the ISPs behind the IPs that serve their pages, and use that info to sue the customer.

When my company was shut down by a legal threat related to scraping, I did suggest to my lawyer that we create something like a browser extension that would grab the data we needed out of normal client-side browsing sessions. This wouldn't be as nice as controlling the flow of information ourselves but it would've worked OK. My lawyer strongly suggested avoiding that as it could've been construed as conspiratorial conduct that would've made criminal prosecution more likely.


Not the discount the validity of your experience, but the usual counterpoint to this is Google, who (like mentioned elsewhere in the thread) has been continuously scraping since the very beginning and in fact built their entire business model on doing so. They are also responsible for advancing the state-of-the-art of scraping (albeit mostly internally), through the development of V8 and headless Chromium so that they can inspect dynamic pages too.

Perhaps this illustrates the fungibility of the legal system: it's an inherently human construct that pits a plaintiff against a defendant, and given a big enough warchest and persuasive-enough arguments, catastrophe can be avoided -- by Google; perhaps not by you, me, or someone else.


Yeah, Google violates the CFAA and infringes on copyright as a matter of course. Their service would be impossible if they weren't doing so.

The main difference when Google was small was that Google was not dependent on any data source in particular, so even if someone denied their robot or sued them, they could cease and desist without affecting the overall value of their offering. This is different if you are getting data that is only available from one or two sources.

Now, the main difference is that Google is one of the biggest companies in the world, and they'll sick an army of $1,000/hr lawyers on you if you even think about taking legal action against them. The only people who can afford to fight are other big companies, but that's not going to happen because they all depend on breaking the CFAA for their own purposes and then using their position as a huge company to bully small innovators.


Google's crawling and caching has been largely found to be fair use and thus is not considered to be infringing copyrights.

https://en.wikipedia.org/wiki/Field_v._Google,_Inc.

There are similar rulings for thumbnail images:

https://en.wikipedia.org/wiki/Perfect_10,_Inc._v._Amazon.com....

And of course books:

https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....


Incidentally, this only further proves my point. If you're a big company that's retained massive law firms, you can successfully raise a fair use and implied license defense. If you're not, you can neither mount a strong offense against that defense nor mount a strong defense against Google's hypocritical offense if you find yourself on the other side.

Google's primary out here is its reputation (not guarantee) for obeying robots.txt. If Google indexed a page that disallowed it in robots.txt, the case would be much stronger. There's also the unofficial out, which is that judges think Google is a cool large company, so they rule in their favor based on their personal biases.

Fair use is a case-by-case basis, so you can't say that Google's infringing conduct is generally accepted to be fair use. The EFF had to take on Universal in Lenz v. Universal Music Group, and that went up to the Supreme Court. That's how individuals are left to assert their fair use rights.


You claimed "[Google] infringes on copyright as a matter of course" despite the many real civil cases (previously cited) which have found these very activities to be non-infringing. And then, strangely you claimed:

>Fair use is a case-by-case basis, so you can't say that Google's infringing conduct is generally accepted to be fair use.

There is so much wrong with this statement. For one, how can you call something infringing at the same time you point out that nothing has been proven? That simply defies all common logic.

Secondly, in general terms, the activities in question have been found to be non-infringing by the courts. Sure fair-use is case-by-case but if you're operating within similar parameters as a previously litigated case, then the legal risk is immensely reduced.

I don't disagree with your assertion that the legal system greatly favours the well monied/connected (I don't think anyone would). But you can't claim it to be fact that Google Search is infringing anything with little to no evidence or rulings to cite. Unless you're just stating an opinion in which case you should clearly indicate that.


First, IANAL, so my use of some terms may be loose. I never intend to convey more than an informed layman's opinion. However, I do love it when I'm corrected so that my usage can improve.

Fair use is an affirmative defense. Google admits that it copies content without legal license to do so, but claims that said copies are non-infringing under fair use exemptions. I guess you're probably correct that it's no longer appropriate to refer to Google's behavior specifically as "infringing", just "copying without authorization", which, for those of us without $5 million to commit to a legal team, means "infringing". I will try to remember the special standard of law which has been allowed to Google and refer to their copying only as "unauthorized" and not "infringing" in the future.

If you review the points summarized in the Wikipedia articles you helpfully linked, you'll see that Google's defense is mostly "Yeah, but we're Google".

In Field, "the court found that the plaintiff had granted Google an implied, nonexclusive license to display the work because of Field’s failure in using meta tags to prevent his site from being cached by Google.", i.e., because Field already knew Google existed and knew there was a standard way to prevent its access but chose not to employ it, he gave Google an implied license.

Who else does that work for? Can I send an email to Netflix and tell them "Hey, if you don't want me to copy your shows, please add this in your page's HEAD element: <meta name='please-dont-download-my-shows-sir'>"? No?

I understand there are other criteria which were used to decide if Google's use was specifically infringing in addition to the implied license. Just demonstrating that Google is getting favored treatment from the judiciary that would not be available to a normal entity.

In Perfect 10 [0], the judge even explicitly indicated that he was loathe to find Google's use of thumbnails infringing because he didn't want to "impede the advance of internet technology", but that he felt the law obligated him to do so (his ruling in that matter was overturned on appeal, when the Ninth Circuit found Google's usage non-infringing). What if the defendant had been some company perceived as less technically advanced than Google? This is probably as close as you can get to an explicit statement of favoritism. The Ninth Circuit also rejected Perfect 10's claim that RAM copies were infringing (which was not the case with an unlucky non-Google company discussed further down).

What if I started indexing and rehosting thumbnails? I can assure you that I would get C&D'd almost immediately and I would be forced to shut down because I can't afford to pay lawyers for 3 years while the case works through the system (and to be honest, I'm surprised it only took 3 years). And even if I could, with a reputation less sterling than Google's, there's no reason to believe that a judge would rule in the favor of one useless guy instead of a big company. A judge would look at the case and say "Google's use was fair because it provided a public service [actually cited as part of the justification in most of your linked cases], but this guy is just using it for a few hundred people, it's definitely unfair, he owes that company more money than he'll make in his life, case dismissed".

There are many such cases on the books. I don't know if Google has a direct connection to the reptilian overlords or what, but it seems in most cases where they're not involved, the good side loses.

In Craigslist v. 3Taps, while primarily a CFAA case, 3Taps was found to be infringing copyrights by sampling Craigslist postings in order to allow its clients to plot them on a map. Being a "public service" or a "referential use" didn't matter for them. They were raked over the coals, and it's been that way with most cases.

In Ticketmaster v. RMG Technologies [1], RMG was found to infringe just by parsing a page. "Defendant's direct liability for copyright infringement is based on the automatically-created copies of ticketmaster.com webpages that are stored on Defendant's computer each time Defendant accesses ticketmaster.com. [...] Defendant contends [...] that such copies could not give rise to copyright liability because their creation constitutes fair use[.] [...] Defendant's fair use defense fails."

The case specifically discusses how, despite the precedent in Perfect 10, since the Defendant is not Google, it is bound by a site's Terms of Use and copyright law, and RAM copies, which are specifically non-infringing for Google, were infringing for RMG.

Very similar findings were made in Facebook v. Power Ventures, and the founder was left holding a bag of $3 million in personal liability.

This is a thread about the legality of HN users scraping. It seems Google is the only entity capable of making unauthorized copies and then getting courts to agree that it's fair use. For the rest of us, it's infringement, which carries stiff penalties (and this doesn't even broach the CFAA portion of the issue).

So when I say "infringing", I mean something that would be considered infringing if you aren't Google. It's apparently only infringement if the judges involved don't personally use your site and don't have to worry about personally suffering the consequences of not having access to it. :)

[0] https://www.eff.org/document/perfect-10-v-google-ninth-circu...

[1] https://scholar.google.com/scholar_case?case=147697505884223...


What you've failed to mention is the criteria used to determine if a usage is indeed "fair". There are 4 basic criteria[0] but can be summarized as "If the usage doesn't affect the market for the original work, is substantially transformative, is proportionally insignificant or is used for critique/parody then it is fair". Or, at the risk of over simplifying it: "Does the usage grant a net public benefit without significantly hurting the copyright holders ability to make money?".

>Can I send an email to Netflix and tell them "Hey, if you don't want me to copy your shows, please add this in your page's HEAD element: <meta name='please-dont-download-my-shows-sir'>"?

Actually, under fair use you certainly can make a personal copy (see Betamax case). If you distribute the work you would likely run afoul of the criteria summarized above.

The robots.txt relevancy is being over stated in your argument. The main criteria used in this case is summarized above. The fact that Google provides an opt-out mechanism is a secondary, supporting argument.

>What if I started indexing and rehosting thumbnails? I can assure you that I would get C&D'd almost immediately

A determination of infringement would depend entirely on the context as related to the afore mentioned criteria. The fact that someone might try to sue is a product of the terrible system in general and you're absolutely right - as with any legal matter the entity with the deeper pockets can often bully the other guy into submission.

>In Craigslist v. 3Taps, while primarily a CFAA case, 3Taps was found to be infringing copyrights

My understanding is that the copyright part of the case was thrown out [1] and thus was settled solely around CFAA matters.

>In Ticketmaster v. RMG Technologies , RMG was found to infringe just by parsing a page.

I agree that the logic used for the judgement is absurd (for reasons that are plainly obvious to any HN user). But it's less clear whether the case would meet fair use criteria outlined above should it have come to that. My guess is that it wouldn't qualify since the usage affects the copyright holders ability to make money on the work and doesn't meet any of the other criteria for Fair Use.

>Facebook v. Power Ventures

This is not a case involving a defense of fair use (as far as I can tell). Facebook even acknowledged the users owned the data and had a right to it. The defendant was actually found to be violating CFAA and CAN-SPAM acts.

>It seems Google is the only entity capable of making unauthorized copies and then getting courts to agree that it's fair use. For the rest of us, it's infringement

Provably false [2]. It sounds like perhaps your personal experience has soured your opinion on the matter? That's understandable. But none of the evidence you've cited supports the argument that Google is infringing copyrights in its core activities nor that Google is the only entity where copyright laws and fair use legislation don't apply.

PS: To be clear, my argument revolves specifically around copyright infringement and fair use. I don't have enough understanding of other, separate legislation like CFAA to comment on that except to say that it seems overly broad and unrealistic. But that's another topic. I'm specifically arguing against calling Google a copyright infringer in a broad sense which is what you've done. That's not been proven.

[0] https://en.wikipedia.org/wiki/Fair_use#U.S._fair_use_factors [1] https://techcrunch.com/2013/04/30/craigslist-3taps-lawsuit-d... [2] http://fairuse.stanford.edu/overview/fair-use/cases/


>What you've failed to mention is the criteria used to determine if a usage is indeed "fair".

Yes, I understand that the criteria for fair use is defined in the statute. What I'm saying is that like most things brought before judges, arguments can be made either way, and judges seemingly favor Google but not smaller defendants. Thus, while the RAM copies of web pages made by Google are fair use, those made by RMG aren't.

If you look at the Ninth Circuit's ruling in Perfect 10, the length they stretch to reverse the District Court's finding of thumbnails as infringing is ridiculous. It's pretty clear that thumbnails are direct infringements and that you don't invalidate the copyright or create a truly "transformative use" by making it smaller and adding it to an index. Perfect 10 was certainly of this opinion, and I'm sure they saw a real impact to their revenue.

Over the years I've learned that no position is too high to disregard the human factor. 99% of the time people are going to act primarily to their own benefit and work backwards to find rational (or rational-sounding) arguments to justify it. Judges are politicians and they're very image-conscious. None of them wants to be the one to make Google Image Search useless.

You seem to be saying that since Google's use was found non-infringing in these cases, its use is objectively non-infringing. I don't agree with this. Rather, I think that Google's conduct is a pretty plain violation of the relevant statute(s) and that most of it is not covered under fair use, the way the laws are currently written. I think that judges apply the statute in full force when smaller defendants present, but that they have a bias for Google (which is really a bias for themselves, since they know that serious backlash awaits the judge who puts the kabosh on it) that causes them to contort the law pretty heavily so that they can rule the way they want to.

>Actually, under fair use you certainly can make a personal copy (see Betamax case).

See, we were on the right track before we got into networks. Since then, the rulings have been pretty darn bad. The modern "Betamax case" may well have been American Broadcasting Cos. v. Aereo, Inc. [0], and it wasn't a win for us.

Note also that separate from the copyright concern, the DMCA makes it illegal to circumvent a copy protection device (or indeed, even to teach another how to do so). Since Netflix employs DRM, even if there is a fair-use right to a copy of a Netflix program (which is by no means certain), you'd probably have to break the DMCA to obtain it.

>The robots.txt relevancy is being over stated in your argument. The main criteria used in this case is summarized above. The fact that Google provides an opt-out mechanism is a secondary, supporting argument.

I disagree. Google has been able to discharge all CFAA claims because the judges have said "Well, you knew there was a way to stop it." If that's the logic, I'll happily inform the parties I may scrape that there's a way to stop it.

>A determination of infringement would depend entirely on the context as related to the afore mentioned criteria.

Yes, I understand that the judge would write a report that appeared to consider the relevant criteria. The real question is, would that judge be willing to make the same logical contortions that other judges have made for Google?

I think that he would just go in favor of his biases, and right now we have a judiciary that is heavily biased against the little guy from the start, and this is only exacerbated by an inability to retain hotshot lawyers.

>My understanding is that the copyright part of the case was thrown out and thus was settled solely around CFAA matters.

The only portion of the copyright claim that was dismissed was Craigslist's claim that it owned an exclusive license in the scraped content. This was based on a short-lived ToU update that was specifically intended to strengthen Cragislist's case in this instance. The remaining copyright-related claims were allowed to stand, including a claim that Padmapper had violated a copyright Craigslist holds on the collection of advertisements (rather than on the advertisements themselves). [1]

>[re: RMG] I agree that the logic used for the judgement is absurd (for reasons that are plainly obvious to any HN user).

If you agree the logic was absurd, you agree that a copy of the page that exists in RAM for microseconds does not qualify as a protected copy any more than the reflection of an image on one's retina qualifies. As a "copy" that should be ineligible for copy protection, it doesn't matter if it qualifies for fair use (and I don't necessarily agree that it wouldn't).

> [re: Facebook v. Power] This is not a case involving a defense of fair use (as far as I can tell).

Correct. I was including it because it's an example of Google getting another free pass for stuff that shuts others down, which is the CFAA. CFAA claims are raised against Google in at least Field and Perfect 10, and they get dismissed based on the judge's assumption that the plaintiff knows about the special steps Google makes you take to stop them from violating the CFAA, the absurdity of which we've already discussed.

My wording that the "findings were very similar" was definitely bad since a different law was in play. I meant they were very similar in nature, not in fact. That said, it's likely the only reason that the cached pages weren't considered infringement is that Facebook didn't bring it up.

>But none of the evidence you've cited supports the argument that Google is infringing copyrights in its core activities nor that Google is the only entity where copyright laws and fair use legislation don't apply.

Again, I'm discussing this from a practical position, not one that is strictly compliant with legal theory, where judges always enforce the law with perfect equity, and in which anything a judge (or jury) finds becomes Official Truth de-facto.

From a textbook perspective, sure, everyone has all the same rights and the legal system is always applied equitably. I simply don't believe that has borne out in practice when it comes to internet-centric companies that aren't household names.

It seems that the things Google does are considered infringement when other people do them. Thus, it behooves to know the actual law and follow it, even if Google gets a free pass, since we can't rely the judiciary to interpret the law favorably for us.

RMG is a great example because it occurred after Perfect 10, and the same argument against RAM copies was raised in both cases. It's apparently fair use if Google scrapes your page to download and rehost all of your images, but it's not fair use to read out non-copyrightable factual data unobtainable from any other source (like ticket prices and event times) and rehost it nowhere. Sure.

The alternate lesson here is to focus on getting really big and powerful really quickly, and making sure you cultivate a good public image, so that judges are afraid to rule against you in ways that would affect a product offering upon which millions of people depend. That seems to have worked for most big internet companies, actually. Definitely worked for Facebook and Google.

[0] https://en.wikipedia.org/wiki/American_Broadcasting_Cos._v._....

[1] http://www.dmlp.org/sites/dmlp.org/files/2013-04-30-Order%20... pgs. 9-16


> they'll sick an army of $1,000/hr lawyers on you

They don't even need to do that. They just cheerfully agree to not scrape you, and wait for you to come back and beg to be re-instated when your search traffic plummets.


Isn't that what webarchive/wayback machine do? I think they use a "Fair Use" defense.


Oh for sure. But BitTorrent is still around and works great!


Well, it's just a technical response code. 200 OK - everything went as normal, here's your data. By the same margin, the door on a shop doesn't stop you walking out without paying and the road markings don't stop you from driving in the wrong lane.

I think imbuing technical protocols with legal implications would be even worse than the current situation since then changing anything on a protocol would require changing the law and getting a protocol implementation slightly wrong would carry real-world legal repercussions on the order of licensing your work in the public domain rather than retaining copyright. Let the lawyers make the law and check the human terms of service before using the data. Trying to out-lawyer the lawyers is like challenging a hedgehog to a butt-kicking brawl.


The protocol is also that you send a valid, non-faked User-Agent:

"The User-Agent request-header field contains information about the user agent originating the request. This is for [...] the tracing of protocol violations [...]. User agents SHOULD include this field with requests"

Many scrapers disregard this part of the protocol. Of course, whether a headless browser should send a different UA is an interesting question.

https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html


User-Agent is a SHOULD, not a MUST. There are also practically no browsers that send a non-fake User-Agent, since they almost all claim to be Mozilla/5.0.


One rarely visits corporate property in order to access corporate websites. The analogy may be flawed, but this objection to it is as well.


A better metaphor for this would be the "Sunday Flyers" that come in the newspaper (e.g. for big box stores like Best Buy). They sent that information to you, they can not then attempt to restrict how you use that information (though they have tried to claim copyright over pricing against sites that aggregate the flyers).


Yes, it's important to understand that in the United States, web scraping is usually an illegal activity under the CFAA. If you draw enough attention, your scrape target will notice and threaten you, and probably follow through with the suit. Since the CFAA prescribes both civil and criminal penalties, you may even find yourself in jail for accessing data without the company's approval. Aaron Swartz was being prosecuted under these provisions for scraping public domain data.

The CFAA is a really bad law and creates the network effect lock-in that we all considered a natural part of the web. It doesn't have to be that way -- users should be free to use any browsing appliance they want, including so-called "scrapers".

Big companies like Google not only got their start by flagrantly violating the CFAA, copyright, and privacy laws, but they continue to do so. The moral of the story is hurry up and get big before you get sued or arrested.

There's a long history of ridiculous web scraping rulings based on technical misunderstandings by neophyte judges, including Ticketmaster v. RMG, where infringement was found because the company scraped data out of a page with the Ticketmaster logo on it.

Facebook sued a company called Power Ventures which read out only the user's own data. The founder was found personally liable for $3 million in damages. Facebook did this because they don't want it to be easy for their users to move between social media services. If it's easy, Facebook has to compete on merit instead of just keeping switching costs high. Facebook doesn't like that, so they sue people who make it possible -- and the law says they should win.

We badly need a revised law, but the powers-that-be will strongly oppose it because it would threaten their monopoly over web properties. They continue to flaunt their strategic ignorance of these laws and then take shelter behind them to stop risk from small innovators (i.e., having to compete fair and square).

In the real world, we have a lot of laws that mostly prevent this kind of bad behavior. In cyberspace, the structure is such that most of those laws are not applicable. We need to update and port the pro-small-business logic we have for meatspace companies so that it counts online too. The state of affairs online is really bad.

I want to get a law called the "Consumer Data Freedom Act" passed, which would allow users to access any web property with any non-disruptive browsing device, including custom scrapers that don't impose much more load than a typical user browsing session would.


Yes laws are neat and a reason for attending law school I suppose. I'm of the simpleton opinion that TCP/IP and the other protocols are the law of the net, and you ought to start with those.


One of my favorite scenes from 'Blow':

Judge: George Jung, you stand accused of possession of six hundred and sixty pounds of marijuana with intent to distribute. How do you plead? George: Your honor, I'd like to say a few words to the court if I may. Judge: Well, you're gonna have to stop slouching and stand up to address this court, sir. George: [stands] Alright. Well, in all honesty, I don't feel that what I've done is a crime. And I think it's illogical and irresponsible for you to sentence me to prison. Because, when you think about it, what did I really do? I crossed an imaginary line with a bunch of plants. I mean, you say I'm an outlaw, you say I'm a thief, but where's the Christmas dinner for the people on relief? Huh? You say you're looking for someone who's never weak but always strong, to gather flowers constantly whether you are right or wrong, someone to open each and every door, but it ain't me, babe, huh? No, no, no, it ain't me, babe. It ain't me you're looking for, babe. You follow? Judge: Yeah... Gosh, you know, your concepts are really interesting, Mister Jung. George: Thank you. Judge: Unfortunately for you, the line you crossed was real and the plants you brought with you were illegal, so your bail is twenty thousand dollars.


Excellent news. I'm of the opinion that might makes right is the law of the land, and I'm going to start by buying a bigger gun.


If you were really of that opinion I think you would hide instead.


Or you just move to a locale where scraping is legal, and any contractual terms saying otherwise are null and void.

I’d assume a lot of HN users are from such locales.

We don’t always have to assume US laws apply globally – they don’t.


I was actually searching for such a jurisdiction as my startup was shut down by a company that invoked the CFAA late last year. What do you suggest? The EU is even worse than the US when it comes to data freedom and tech access. The law on the books in many former British colonies provides marginally more protection (the "Telecommunications Act"), but it'd probably still be disputable, and you'd be shut down anyway unless you had millions sitting around with your lawyers' name on it.


It depends on what you are doing. The CFAA is very far-reaching, but of course many aspects were the US answer is "CFAA" are covered by other laws. [EDIT: removed outdated information superseded by european decisions, which make the situation a lot less clear]

anti-scraping: If somebody were to offer a telephone book database online and you created a copy of that to sell on your own, you'd almost certainly loose in the EU (since unlike in the US, databases as pure collections of facts have their own copyright protections)

The legally safest locations probably are outside the western world if you are targeting western sites.


>Pro-scraping: Last big case I remember here was a flight-search site that did flight search and booking(!) via a scraper and Ryanair lost when they tried to sue them for that, since they couldn't argue convincingly how that was damaging them.

Every case I've seen wrt Ryanair (they sue a lot of people) has resulted in a win for Ryanair. Do you have details on the case you're describing?

>anti-scraping: [...]

Scraping purely factual data is one of my points of defense in the US. I don't want to give it away.

>It's still risky though, the safest locations probably are outside the western world if you are targeting western sites.

Yeah, this was ultimately the conclusion I had to come to. However, outside the West, the Western companies will just send someone with a briefcase full of $100 bills and pay them off. Corrupt government officials in these locations want the goodwill of a big American company a lot more than they care about any particular random guy.

There is only one workable solution: run the service totally anonymously and maintain good opsec so that your cover isn't blown. All under the table. This has its own issues, like making it difficult to receive payment and putting one at much greater legal risk than a mere CFAA dispute, but it's the only option if you don't plan to get shut down.


I was refering to a BGH decision (30.04.2014 (Az. I ZR 224/12)), but it seems like newer decisions from european courts kill that argument :/

I edited my original comment to reflect that.


Which site were you running?



Dubai?


Scraping being illegal is as dumb as saying it's illegal to take photos in public. You aren't affecting anyone if you do it respectfully.


Being dumb doesn't prevent laws from existing (we all know those "funny US law stuff" in the realms of "no kissing toy camels on the cheek (but the mouths are okay)".

Also, since this is somewhat untouched territory, don't be so sure that you'll get a judge who is as well-versed in web scarping and infrastructure as you, or shares your opinions on the subject. (And given that precedents are so important in US laws, you better hope someone else before you didn't get such a judge.)


Obviously it's a good idea to follow TOS. But as a practical matter, they have to know that you're doing it before they can take action. You wouldn't want to put up a site announcing that you're selling scraped LinkedIn data, for example. But if that data is valuable to your business - collecting names of people that work in certain positions at certain companies so that you can do targeted snail mail campaigns for example - you could quietly scrape and use the data without issue. Use proxies and prosper.


This only goes so far, and if you get found out, you're looking at willful infringement (which usually triples damages) and probably criminal charges under the CFAA. However, it should be acknowledged that there are many people making quiet livings off scrapes that are not legal. There are even a few companies making loud livings off such scrapes, like Google.

If you're not going to run it totally anonymously, you should be prepared to jettison and repackage it when you get found it (so that you appear to be complying with the C&D).

Scraping is a huge part of the web, and everyone does it. It sucks that it has to live underground because only big companies can duke it out in court.


TOS violations have been found to not be subject to the criminal provisions of the CFAA. The only circumstance under which it would become a criminal issue is if they successfully sued you for it, and obtained a judgment that included a provision ordering you to cease scraping. Ignoring such a court order would then potentially expose you to a criminal contempt of court action.


I think the lessons learned from those lawsuits was to always have some sort of 3rd-party intermediary scraping consultancy firm you engage that is totally not just your business under another name.


> (Disclosure: I have developed a Facebook Page Post Scraper [https://github.com/minimaxir/facebook-page-post-scraper] which explicitly follows the permissions set by the Facebook API.)

I played with the idea of creating some social aggregation type service with some friends (as a business). The more I read about FB's past behavior with regard to this, and how essential they are to any sort of service like, that, I canned the project. Regardless of what their TOS say, if you get on their radar and they send you a cease-and-desist, it's game over. Facebook is not in the business of subverting their revenue stream, so if you are making money off them and it's preventing them from capitalizing on their users, don't expect to last long if you exist by their grace.

Really, there's an interesting space between so small nobody cares and large enough that getting shut down is a real problem. A lot of projects start small and end up (relatively) large, but without a good way to pay for the service itself. While not every service needs to be a business and make money, once you reach the level where you risk either being shut out of your data source or you need to somehow work out an understanding with that source, how do you approach that when being able to pay is off the table? Not to mention the problem approaching before you have to and forcing the situation, or waiting too long and risking the wrath of the source because you've abused their service as long as you have. Has anyone else been in this situation and found an approach that works?


One thing I'm not quite clear on here.

I understand the use of ToS clauses to prevent scraping but I do kind of wonder to what extent they have authority here.

IANAL, but surely this would fall under copyright law? While re-publishing copyright-protected data without consent is probably unlawful in your region (like scraping an art site and re-posting the images), I wouldn't think just scraping data points for a different purpose (like scraping amazon for the purposes of price comparison) is nearly so clear cut (or enforceable), but maybe I'm just naive.


The content falls under copyright law. The problem is that you have to enter the company's servers to obtain this data, and the CFAA says that the company can treat their public-facing web servers like private property, and if you're caught "trespassing", you can be sued and jailed. Scraping plaintiffs are usually granted an injunction based on "trespass to chattels" (among other rationales), i.e., trespass to an individual's property (as opposed to land).

Companies like PriceZombie are forced to stop because the CFAA says that Amazon can prevent them from accessing their servers by decree alone. A ToS isn't even really necessary for this, but it helps them pin down their argument.

PriceZombie could try to get the data from third-party caches, but it only solves part of the problem, because copyright and trademarks come back into the picture once you have a replica of the target page. In Ticketmaster v. RMG Technologies, the judge found RMG infringing on Ticketmaster's trademarks and copyrights because the page they were scraping included Ticketmaster's logo. The judge said the copy of the full page that existed momentarily in RAM while the scraper extracted the non-copyrightable data constituted a copy that infringed on Ticketmaster's rights, even though the logo was never used by the application in any way, it just happened to be on the page.


I was going to post something similar. When you go to all that trouble that the web site owner is pretty clearly trying to prevent, that is convincing evidence that you are breaking the terms of service. And breaking the terms of service for a web site has been held to be a civil violation (a number of times on Ebay and Amazon) and potentially a CFAA violation by the Justice department.


Actually it's been held that TOS violations are NOT subject to the criminal provisions of the CFAA.


Are you referring to the MySpace case? Or the July 2016 decision by the Ninth Circuit (https://cdn.ca9.uscourts.gov/datastore/opinions/2016/07/05/1...). In US v Nosal it seems like they come down in favor of a CFAA violation if the user acts in an unauthorized way. The author of the piece talks about bypassing captcha's which are, in one interpretation, a demand for authorization (by proving that you are a human and not a program) and by circumventing that authorization they have stepped quite clearly into CFAA territory.

If you were referring to a different decision I'd love to read it. I follow this stuff (and at one time explored what legal action our startup could take against scrapers). In our case we also offered a paid API so it was fairly easy to establish damages.


The case you're referring to is an entirely different set of circumstances. From the text:

"The panel held that the defendant, a former employee whose computer access credentials were revoked, acted “without authorization” in violation of the CFAA when he or his former employee co-conspirators used the login credentials of a current employee to gain access to computer data owned by the former employer and to circumvent the revocation of access. "

I think that case is unambiguous - this guy was using someone else's credentials to access secured systems after having been explicitly told that he could not. I was referring to the MySpace case.

I don't think these two cases are in conflict; IMO they are very different. Additionally, for our purposes in this comment thread, we're talking about scraping of publicly available websites by outside parties, not by former employees whose access has been explicitly revoked. That is different than either of these cases.


I am no expert, but I always thought you could scrape without consequence provided you never distribute your scrapings?


There are hundreds of paid services that scrape Google heavily (search engine ranking trackers). How are they legal?


They probably doing it from country where it's legal. In most countries there is no law that would be applicable in this case.


Which countries is it legal in?


They aren't, or at least, they won't be if Google decides it doesn't like them anymore and decides to bring the matter to court.

The CFAA says it's a crime to exceed "authorized access". Authorized access is whatever the server's owner says it is. If they change their mind, you must cease and desist or risk both civil and criminal penalties. A contract defining the length and nature of your authorization from the server's owner would go a long way to establishing your rights to access, but no one is going to give that to a small player.


You forgot to add: In USA.


Unfortunately this is true in almost all of the developed world. While developing countries may not have specific legal prohibitions, we know that that doesn't stop big companies from having their way. Heck, even a first-world country like Sweden couldn't resist the pressure from Hollywood to prosecute and jail the operators of the Pirate Bay, which had long been recognized as totally legal in Sweden.

Another issue is that on the internet, jurisdiction is a very messy affair. An American judge will likely determine that California and/or the federal government has jurisdiction over such a case because Google is based in California. Most developed countries have treaties with one another that allow them to enforce foreign civil judgments on behalf of the jurisdiction that entered them. Most developed countries also have mutual extradition treaties. The countries that don't can easily be paid off by an interested party.


Isn't Google search based off of Google "scraping" the web?


Bingo


The code is pretty cool. Thanks for releasing that! May I ask why you built your own scraper infrastructure and not build it on top of a known framework like scrapy (which is in python as well).


I actually wonder if that constraining by the ToS has any legal validity in EU. Since in here, typical that kind of stuff cannot be enforced legally.


Corporations will abuse your personal integrity whenever they get a chance, while abiding the law. Corporations will cry like babies when their publicly available data (their livelyhood) gets scraped. They will take you to court.

They consider their data to be theirs, even though they published it on the internet. They consider your data (your personal integrity) to be theirs as well, because how can you assume personal integrity when you are surfing the internet?

I have high hopes that the judicial system some time not too far from now will realize that since the law should be a reflection of the current moral standings it will always be behind, trying to catch up with us and that those who break the law while not breaking the current moral standings are still "good citizens" unworthy of prison or fines.

I guess Google won this iteration of the internet because of the double-standars site owners stand by, to allow Google to scrape anything while hindering any competitors from doing the same. There will only be a true competitor to Google when we in the next iteration of the internet realize that searching vast amounts of data (the internet) is a solved problem, that anyone can do as good a job as Google, and move on to the next quirk, around wich there will be competition, and in the end that quirk will be solved, we'll have a winner, signaling that is it time to move on to the next iteration.


> Corporations will abuse your personal integrity whenever they get a chance, while abiding the law.

Call my cynical if you will, but I'd leave "while abiding the law" out of that, or at least replace it with "while hoping they aren't breaking the law". Due diligence on these matters is often sadly lacking. They'll take the information first and only consider any such implications when/if they come up later.

Large organisations like Google probably will make the up-front effort to remain legal, because they are in the public eye enough for lack of doing so to attract a lot of unwanted press, but you don't have to get a lot smaller than that to start finding companies who are a lot less careful (or in some cases wilfully negligent).


I would use Microsoft as a precedent. Sure they will attempt to stay legal but by pushing it as far as they can.

For instance the browser choice script that came with Windows imposed by the EU never worked. It was a "bug". Somehow they must have omitted to test the feature...

Until last year Microsoft started playing nice, and I think Google and Facebook have become the new corporate villains. But recently the Windows team seems to be minded to challenge them in that position.


Often, it's indeed cheaper to pay a government-mandated fine than lose market opportunities afforded by behavior that later runs afoul of some law or regulation.


The difference is that Google didn't agree to not scrape your data. You, as per their TOS, agreed not to scrape theirs, as part of the condition of using their service.


Which TOS?

I might have accepted terms when I created a Google Account but in no way do I agree to a TOS by visiting a URL.


To see the terms that Google thinks you have agreed to, click 'Terms' at the bottom of www.google.com

If that doesn't hold up in court, in future on your first visit to Google it will simply display some text and require that you click 'I agree' to continue.

Either way, it seems reasonable to me that you should agree to their terms in order to use their service.


So if instead I scrape their site (like they are scraping others) I don't have any opportunity to agree to their terms? Much like their scrapers on other sites?

I'm honestly wondering about the double standard. There is a rational way to discuss morality/ethics and subsequent laws regarding most technical aspects, that often mirrors real world (read: offline/analog) scenarios. It's unfortunate that the legal system has instead been appropriated by lawyers.


>> It's unfortunate that the legal system has instead been appropriated by lawyers.

omg, really?

It's unfortunate that the internet has instead been appropriated by hackers. It's unfortunate that the stock market has instead been appropriated by traders. It's unfortunate that the asylum has instead been appropriated by inmates.


To some extent, yes. When people spend enough time in their given field to know the ins and outs, those less scrupulous tend to bend the rules more and more. While not _strictly_ against the rules it often ends up going against the spirit at the base of the industry.

Very few traders went to jail after 2008. Seemingly legal (or at least not illegal). Should they have? Most bright/talented lawyers are likely working (again within the law) to get megacorps or rich people off for something poorer people would not. In our field this OP is one of the issues. What information is free and what information is not? What things I'm allowed to do offline am I allowed to do online?

I'm not proposing a solution, but any system populated by humans will be abused by some, and fought for by some idealists, all within that systems rules.

Let's take murder: I stab someone: murder. I use a broom to push a flower pot off a balcony hitting someone in the head, killing them: murder. I swat a butterfly in Beijing, causing a chain of events to a container crushing a dock worker in Rotterdam. Murder? If this extreme example comes down to intent it's thought crime, otherwise I'm playing within the rules of the system, and I just killed someone, scot-free.

While there apparently were no laws prohibiting the upsale of bad mortgages, and banks having the resources to move the market towards more and worse mortgages, that also was within the systems rules, but I personally think it's far beyond the intended use of that market, and well outside the spirit of the laws.

There's a huge difference between judicial justice and what most would agree was "justice". That's where my first comment came in. True about most systems.


Don't leave us hanging--did the butterfly make it??


Yeah, really. Law is not an end in itself, it's meant to serve a purpose instead. When people, whose jobs is instrumental to the goal, start deciding about what the goal is, bad things happen. The same is with MBAs and businesses.


There's no double standard. In the case of crawling and scraping their site the terms are available in the robots.txt file. And Google abide by the robots.txt terms of other websites.

I'm not sure why you dislike this 'appropriated by lawyers' outcome: For web crawling look at robots.txt, for other uses look at the Terms link on the homepage. If you don't agree to the terms then stop accessing the website. Seems straightforward and fair to me.


Yeah, you're right in response to my comment. It was a bad example. But while google.com (for example) has a robots.txt, you could argue that it's not exactly fair nor inviting disruption. For example whitelisting twitter and facebook for images (subsequently blacklisting everything else). While I won't cry too much foul, I get the feeling that Google entered the stage when internet was quite a bit more wild west (for good and bad) and then the internet changed, partly by them and partly by other actors. For at least some markets I believe it's almost impossible to get a footing now as a new actor, as it's only available to (what is basically) cartels. Email being another one, as you can be locked out of gmail.com or outlook.com communication with basically no discourse if you run your own email server.


The TOS that Google follows is published in the robots.txt file. If you don't want Google to scrape your site, then that's all you need. There's no double standard.


I'm sure that's true for your average Wordpress publisher, but the big guys will either slap you with a law suit or take other measures to make you stop crawling their site.

Scraping and crawling is the same thing btw. I absolutely love how the English language has several words for the same thing. Your language very expressive.

Google is a scraper. Your data will end up in their index. You are perfectly OK with Google "stealing" your data.

A new player crawling your site is an offence to you. How dare someone other than Google or Bing put preasure on my site? How dare they steal my data?

TOS is a joke.

I wonder, what was the intention of the founding fathers of the internet, of the internet? Was it not to make data publicly available?


> If that doesn't hold up in court, in future on your first visit to Google it will simply display some text and require that you click 'I agree' to continue.

This statement is demonstrably false, as shown by all the places in the world where this type of TOS-nonsense actually does not hold up in court.

And in the USA, it's (as usual) even slightly more absurd: The only reason it does hold up in court is because Google can afford justice.


Try using google from a fresh install, they´ll force you to accept their TOS.


Are they A/B testing this or is acceptance IP-based? I reinstalled recently and I didn't see it. Firefox in private navigation mode also lets me use it without forcing me to agree with anything.


Lucky you. I get their stupid modal overlay more often than I'm happy with. On top of that it now usually defaults to Dutch and Dutch results even when I don't want this. Highly annoying.


I'm under the impression that simply having a visible legal notice like "By visiting this page you agree to our ToS" is enough to bypass that.


Not in the EU, you have to explicitly and manually agree to them.


Regarding the scraping and the legality of it all. I wonder if it's still illegal if you respect the robots.txt and other meta elements in html standards.

If Google's actions were illegal, I'm sure that they would have been sued even if their scraping and indexing usually is helpful for the website owner


I do a significant amount of scraping for hobby projects, albeit mostly open websites. As a result, I've gotten pretty good a circumventing rate-limiting and most other controls.

I suspect I'm one of those bad people your parents tell you to avoid - by that I mean I completely ignore robots.txt.

At this point, my architecture has settled on a distributed RPC system with a rotating swarm of clients. I use RabbitMQ for message passing middleware, SaltStack for automated VM provisioning, and python everywhere for everything else. Using some randomization, and a list of the top n user agents, I can randomly generate about ~800K unique but valid-looking UAs. Selenium+PhantomJS gets you through non-capcha cloudflare. Backing storage is Postgres.

Database triggers do row versioning, and I wind up with what is basically a mini internet-archive of my own, with periodic snapshots of a site over time. Additionally, I have a readability-like processing layer that re-writes the page content in hopes of making the resulting layout actually pleasant to read on, with pluggable rulesets that determine page element decomposition.

At this point, I have a system that is, as far as I can tell, definitionally a botnet. The only things is I actually pay for the hosts.

---

Scaling something like this up to high volume is really an interesting challenge. My hosts are physically distributed, and just maintaining the RabbitMQ socket links is hard. I've actually had to do some hacking on the RabbitMQ library to let it handle the various ways I've seen a socket get wedged, and I still have some reliability issues in the SaltStack-DigitalOcean interface where VM creation gets stuck in a infinite loop, leading to me bleeding all my hosts. I also had to implement my own message fragmentation on top of RabbitMQ, because literally no AMQP library I found could reliably handle large (>100K) messages without eventually wedging.

There are other fun problems too, like the fact that I have a postgres database that's ~700 GB in size, which means you have to spend time considering your DB design and doing query optimization too. I apparently have big data problems in my bedroom (My home servers are in my bedroom closet).

---

It's all on github, FWIW:

Manager: https://github.com/fake-name/ReadableWebProxy

Agent and salt scheduler: https://github.com/fake-name/AutoTriever


Yet another incredible technical achievement due to someone's quest for more porn (https://github.com/fake-name/AutoTriever/blob/master/setting...).


That's a separate project:

- https://github.com/fake-name/ExHentai-Archival

- https://github.com/fake-name/PatreonArchiver

- https://github.com/fake-name/xA-Scraper

- https://github.com/fake-name/DanbooruScraper

Or... well, 4 separate projects. Whoops?

At one point, a friend and I were looking at trying to basically replicate the google deep-dream neural net thing, only with a training set of porn. It turns out getting a well tagged dataset for training is somewhat challenging.

Well-tagged hentai is trivially accessible, though. I think there's probably a paper or two in there about the demographics of the two fan groups. People are fascinating.

Next up, automate the consumption too!


At least Ex supports torrents and also has some custom p2p software which you can run (serves content) from which data can be siphoned off.

And what is served through their website is resized. So web-scraping is an inferior approach.


You seem to be assuming

1. I'm scraping the resized galleries.

2. I don't have the Hath perk that makes the galleries full sized.

3. I don't have a phash-based fuzzy image deduplication system on top of all this (see https://github.com/fake-name/IntraArchiveDeduplicator). It's main purpose is to deduplicate manga (https://github.com/fake-name/MangaCMS).


Jesus, your projects are massive. Does your job involve working on these or are these just side things?


It's all entirely hobby things.


Oh my god. Can you share any results?


The project never went anywhere, unfortunately, and I haven't had time to look at it recently.

I have huge, uh, "datasets" around still, though.


You're doing god's work.


How do you circumvent cloud provider IP blocks? For example, one site blocks all requests from AWS EC2 servers.


None of the sites I'm scraping do that, mostly.

I'm not scraping high value sites like that (I mostly target amateur original content). It's not really of interest to other businesses. As such, I tend to just run into things like normal cloud-flare wrapped sites, and one place that tried to detect bots and return intentionally garbled data.

If I run into that sort of thing, I guess we'll see.


I've never used this, and it's incredibly shady considering the users probably do not realize that their Hola browser plugin does this, but Hola runs a paid VPN service where you can get thousands of low-bandwidth connections on unique residential IP addresses, provided generously through their "free" VPN users.... It's essentially a legitimate attempt at running a botnet as a service.

But if the end justifies the means... http://luminati.io/


I'm sure there'd be a ton of people that would love to pay to use your platform (who cares if the source is available, I don't want to run my own because once the code is written, it's ops thats hard). But then I suppose it would be hard to stay unnoticed.


Yeah, running this thing publicly would be a huge mess from a copyright perspective, since it literally re-hosts everything as a core part of how it works.

As it is, I think I'm OK, since it's basically just a "website DVR" type thing, for my own use.

Really, if nothing else, the project has been enormously educational for me. I've learnt a boatload about distributed systems, learned a bit of SQL, dicked about with databases a bunch, and actually experienced deploying a complex multi-component application across multiple disparate data centers.


This project is really cool. Last year I was looking into open source projects that implement something like Readability so that I could scrape articles from my RSS feeds and turn them into plaintext. But I didn't find anything that blew me away. The best I got was stealing the implementation from Firefox, and I lost interest before I could make it worthwhile. (Now revisiting the idea, I wonder why I never thought of passing a user-agent from a mobile browser... Probably would have helped a lot.)

I see you don't have a license listed on GitHub. Do you have a license in mind for these?


It's probably GPL, I'll have to figure out my dependencies and see what it's infected with. I tend to err BSD on my own cruft.

This isn't quite as fancy as readability, though I integrated a port of readability for a while. Now I just write a ruleset for a site that has stuff that interests me.


Note: I stuck it under BSD license.


Similar, paid solution: https://scrapinghub.com/crawlera/


What would the rough costs be to run the 800k UA scenario?


To be clear, I have a pool of 800K theoretical UAs derived from the mechanism I use to generate them, not 800K clients.

Regarding costs, I really have no idea. It depends on how rapidly you cycle the UA, and how fast whatever you're scraping is.


take my money!


How can I get ahold of you directly?


connorw at imaginaryindustries dot com


Thanks. Email sent.


Wow, that's impressive!


A neat trick I sometimes use to "scrape" data from sites that use jquery ajax to load data is to plug in a middleware in jquery xhr:

      $.ajaxSetup({
        dataFilter: function (data, type) {
          if (this.url === 'some url that you want to watch!') {
            // Do anything with the data here
            awesomeMethod(this.data)
          }
          return data
        }
      })

I remember last using it with an infinite-scroll page with a periodic callback that scrolled the page down every 2 seconds, and the `awesomeMethod` just initiated the download. Pasted it all in dev-tools console, and the cheap "scraper" was ready!


Another trick: You can hover over elements in chrome with F12 and the inspect tool, then right click > copy > copy selector and chrome will generate one for you that way you don't have to actually do any work.

With a selector it's easy to grab data, here's a linux command that gets every user that posted in this thread:

  lynx -base -source 'https://news.ycombinator.com/item?id=12345693' | hxnormalize -x | \
    hxselect -c -s '\n' "td > table > tbody > tr > td.default > div:nth-child(1) > span > a.hnuser"
Here are the most frequent commenters:

     27 cookiecaper
     22 franciskim
      6 fake-name
      4 niftich
      4 flukus
      4 elmigranto
      4 downandout
      3 tedunangst
      3 siegecraft
      3 muglug
      3 minimaxir
      3 madamelic


You can also build a chrome extension if you need to navigate to multiple pages and use a long-running scraping process. I've done this several times and it's really easy to get one up and running if you use an extension boilerplate (30 minutes tops).


do you have something? i was going to write the very same extension (but distributed so i could add it to my pc and my friends) but never did that


This is the boilerplate I used last time: http://extensionizr.com


Didn't know about `extensionizr`, Looks super cool. Thanks!


Yup I've been known to do this as well :) I'd have a Node.js + Mongo endpoint ready on the other side.


Why not user Nightmare with Node.js + Mongo?

Here is an example of injecting a jQuery script into a page with jQuery loaded and getting nicely formatted information returned. [1]

[1]https://github.com/adam-s/playboy-fm/blob/master/server/scra...


This good list of tactics underscores, for me, how the state of the Web has made it a lot more difficult to teach web scraping as a fun exercise for newbie programmers. It used to be you could get by with an assumption that what you see in the browser is what you get when you download the raw HTML...but that's increasingly less common the case. So now you have to teach how to debug via the console and network panel, on top of basic HTTP concepts (such as query parameters).

(Even more problematic is that college kids today seem to have a decaying understanding of what a URL is, given how much web navigation we do through the omnibar or apps, particularly on mobile, but that's another issue).

I've been archiving a few government sites to preserve them for web scraping exercises [0] (the Texas death penalty site is a classic, for both being relatively simple at first, and being incredibly convoluted depending on what level of detail you want to scrape [1])). But I imagine even government sites will move more toward AJAX/app-like sites, if the trend at the federal level means anything.

That said, I think the analytics.usa.gov site is a great place to demonstrate the difference between server-generated HTML and client-rendered HTML.

But as someone who just likes doing web-scraping, I feel the tools have mostly kept up with the changes to the web. It's been relatively easy, for example, to run Selenium through Python to mimic user action [2]. Same with PhantomJS through node, which has vastly improved how accurately it renders pages for screenshots compared to what I remember a few years back

[0] https://github.com/wgetsnaps

[1] https://github.com/wgetsnaps/tdcj-state-tx-us--death_row

[2] https://gist.github.com/dannguyen/8a6fa49253c1d6a0eb92


It's unfortunate that nearly every webpage these days is a Javascript State Machine which you have to execute in a sandbox and inspect its internal state to get stuff out of.

On a blog post by Paul Kinlan ('Open Web Advocate' at Google and Chromium) [1], I lamented that we ended up here instead of the semantic web because the semantic web was hard to execute. Instead, every web page is a black-box, only navigable by an intelligent and/or sufficiently persuadable human.

But this is also why I don't buy ethical arguments against scraping. Sure, legally any company can unilaterally set any TOS prohibition against behavior they don't want, and these terms may be tested in court. But navigating a page in an automated manner that's designed to resemble interactions of humans (ie. through Selenium) is in my opinion ethical, because it merely time-shifts a user's activity.

[1] https://news.ycombinator.com/item?id=12206846


Tbh I didn't enjoy the article, it just seems like someone who has just learned about Node.js tried to explain (and mostly failed) how to use some packages to scrape a page. I was expecting to learn some new techniques, but all it explained was how to make a few API calls in order to solve a very specific problem. Also, there was the overall arrogant tone: "I found their interview approach a bit of a turn off so I did not proceed to the next interview and ignored her emails ", this just shows a lot of immaturity.


>Also, there was the overall arrogant tone: "I found their interview approach a bit of a turn off so I did not proceed to the next interview and ignored her emails ", this just shows a lot of immaturity.

I'm not saying you're one of these people, but it's frustrating when companies do this to potential employees and the potential is told by friends and other management type people, "well that's the company you just have to deal with it".

When someone flips it on the company then it's immature.

I applied somewhere recently and they invited me out to a pre-interview lunch. That went well so they called me in for an interview. That went well and the VP told me he would call me back to set up a second (third?) interview.

I never heard back from him. An ex-coworker there went to the VP to find out what was going on and the VP said he decided he wanted someone with more experience in the specific area they're working in.

But last he told me was he liked me and would schedule another interview, then when he changed his mind he never let me know.

I think people on both sides should be courteous and respectful through the process, but if employers are treating interviewees poorly then they shouldn't be surprised when they start getting treated poorly.


>When someone flips it on the company then it's immature.

First, it's hard to know when companies are doing this intentionally versus when things just get lost in the shuffle. (Never attribute to malice what can be explained by incompetence, and all that.) Meanwhile, the author was clearly ignoring the interviewer intentionally.

Second, the fact that Company A treated you rudely doesn't give you license to treat unrelated Company B rudely. For that matter, I'm not sure that the fact that Employee 1 at Company A treated you rudely gives you moral license to treat Employee 2 at Company A rudely. Show a little compassion for someone trapped in a dead-end job trying to put food on their family's table, for crying out loud.


As someone who has written a scraping framework, this article is useful AF.

> but all it explained was how to make a few API calls in order to solve a very specific problem.

Yeah, the very specific problems everyone runs into time after time. He presents specific solutions, and reasonable context. If I was googling for one of these problems, I'd be very happy to run into this page.

> Also, there was the overall arrogant tone: "I found their interview approach a bit of a turn off so I did not proceed to the next interview and ignored her emails "

Your arrogance is my matter-of-fact.


You can't count as 'matter-of-fact' if you're not even bothering to communicate.


Thanks for your feedback, I do appreciate it.


Not many people can take criticism is stride like that. You are awesome.


:)


As a counterpoint: I think this article is fantastic. API restrictions are incredibly annoying when they pertain to what I consider to be my data. Should data and interface be so tightly joined? Of course not.


What is your data?

EDIT - Completely serious, you mean data you put on other peoples servers, using their services and expect them to let you have it back when and however you want it? Let's be serious here. You're lucky they let you access it at all.


I had a different take -- if one hasn't been keeping up with the "arms race" of modern web scraping (and the countermeasures some sites are adopting these days), or with the JS scraping ecosystem generally, then it seems like this article could make for a decent introduction.


Part of the turnoff for me was the middle-schooler tone and vocabulary. Good walkthrough with good code examples though, obviously written by a very smart JS dev.


Ok, I'll try to explain to this thread. I actually thought about removing the Facebook part, but I kept it in there because that is kind of how I felt and it is real. The middle-schooler tone and vocab is probably because I don't read a lot of books, and English is my 2nd language.

In reply to XCSme - no I am not new to Node and my point of the post is to illustrate some of the techniques that I haven't seen published anywhere to HN and the community. My focus is quite different from what you think it is, so maybe it is my bad for bad writing skills, I'm still new to writing and learning.


Since when is a 'middle-schooler vocabulary' a bad thing? I distinctly remember learning on hn (when the Hemingway app became popular) that simple is better for readability.

https://contently.com/strategist/2015/01/28/this-surprising-...


Ok guys, I've elaborated a bit - tried to make the aim of the post a bit clearer, and I've removed the stuff about Facebook because I don't want to discomfort other readers. It's 5:25AM, it's been a crazy morning and I've got work tomorrow! XD


> "I found their interview approach a bit of a turn off so I did not proceed to the next interview and ignored her emails ", this just shows a lot of immaturity.

I believe you should treat others how you want to be treated. FYI, recruiters do not usually follow up with rejected candidates and many are unresponsive. It's their way of telling candidates they are no longer interested.


When I interviewed for Google the recruiter was really nice and called to tell me that I didn't make it. I also gave her feedback on what I thought went wrong in the interview and what was wrong with their interview process. I don't know how the ones at Facebook are, but my recruiter did her job well and I appreciated that, even though some interviewers messed up.


Not wanting to thread hijack, but just going to post an article I wrote a few years back as it covers a few other things that are still relevant; and often still gets referenced. May it'll help some people out in combination with OP's post.

http://jakeaustwick.me/python-web-scraping-resource/


I was surprised to not see Scrapy listed, but then I saw there were some comments about it - but seriously, doing by hand what Scrapy has spent years perfecting is highly suboptimal.

I guess the distinction is between whether one wants to just "toy around" or run the spider for-real.


Nice post Jake!


I wrote a fairly complex spidering and scraping script in Node a few months ago. I found downcache[1] to be absolutely invaluable, particularly as I was debugging my parsing scripts, a I was able to rerun them relatively quickly over the cached responses.

However, when the network was no longer a bottleneck, I found that the speed and single-threaded nature of Node became one. It wasn't really that slow, relatively speaking, but I had a few hundred gigs of HTML to chew through every time I made a correction, so it was important to keep the turnaround as fast as possible.

I eventually managed to manually partition the task so I could launch separate Node scripts to handle different parts of it, but it wasn't a perfect split, and there was a fair bit of duplicated work, where a shared cache would have helped a great deal.

In retrospect, I should have thrown my JS away and started again in something with easy threading like Java or C#. But -- familiar story -- I'd underestimated the complexity of the task to begin with, and by the time I understood, I'd sunk a lot of time into writing my JS parsing code and didn't fancy converting it all to another language, particular when it always seemed like "just one more" correction to the parsing would make everything work right. In the end, what was supposed to take a weekend took about three months of work, off and on, to finish.

[1] https://www.npmjs.com/package/downcache


Threading in node is very easy, just use clusters. Alternatively, take any of the CPU intensive activity, like parsing the HTML and formatting as JSON, and just put that on an AWS lambda.

You can invoke as many lambdas from your application as you want in parallel and you're not going to be bottlenecked by your CPU :)


Clustering in Node creates isolated child processes, not threads. I needed to have shared queues, in-memory caches, and hashes to coordinate workers and avoid them doing duplicate work.

I'm did consider using clustering and having some master process coordinate everything, and using some shared-memory caching library. But it would not be "easy" to set up, especially compared to something like Java where you get thread pools and synchronized thread-safe collections out of the box.

And Lambda would have been totally impractical. As I said, I had hundred of gigs of data to process. If I'd been uploading this over my puny ADSL upstream every time, I'd still be waiting for a single run to complete.

I'm not trashing Node. I like it. There's a reason I used in the first place, after all. But for this particular use-case, I didn't find it was very good fit.


Threading for a crawler is just a dirty way of not handling distribution. When you will need more than one server your threads won't save you. It has nothing to do with Node.js and thread support.


I wasn't creating a new search engine, I was doing a one-off scraping job in my spare time. Creating a fully distributed solution would have been total overkill. But threading could and would have helped.

Honestly, stupidly hostile and ignorant comments like this are the absolute worst thing about Hacker News.


Wonder how difficult it would have been to pull the JS portion into Java by way of Rhino.


Scraping with Selenium in Docker is pretty great, especially because you can use the Docker API itself to spin up/shut down containers at will. So you can spin up a container to hit a specific URL in a second, scrape whatever you're looking for, then kill the container. This can be done via a job queue (sidekiq if you're using Ruby) to do all sorts of fun stuff.

That aside, hitting Insta like this is playing with fire, because you're really dealing with Facebook and their legal team.


Serious question: What do you gain from having an extra layer like docker?


Well it does make it extra easy to deploy a scrape node to any type of machine you might encounter (and having a diverse set of source IPs is extra important for scraping; that means you might need to deploy to AWS, Azure, Google Cloud, rackspace, digitalocean, random vps provider X and so on). So instead of having to have custom provisioning profiles for every hosting provider/image combination, you just need to get docker running on a host and you're good to go.


Because you can use pre-packaged Selenium in Docker images with a few commands: https://github.com/SeleniumHQ/docker-selenium


Selenium grid runs in docker, so it's easy to have multiple instances running. Better control.


Also, if use Kubernetes to manage the grid you can scale out to your credit card limit on GKE: https://github.com/kubernetes/kubernetes/tree/master/example...


What are the advantages of this versus a thread pool of web drivers? I'm not really familiar with Selenium Grid.


Grid can dynamically dispatch based on the browser and capabilities you want when you create the session.


True that, I hope Zuck sues me so I'll get extra famous


> AngelList even detects PhamtomJS (have not seen other sites do this).

I run a site that aggregates/crawls job boards for remote job postings, and AngelList has been VERY difficult to crawl for various reasons, but you easily get PhantomJS to work (I have). Having said that, I've never felt very good about the fact that I'm defeating their attempts to block me (even though I feel like I'm doing them a favor) and will likely retire that bot soon.

It kinda sucks that I'm just grabbing publicly-available content in a very low-bandwidth way, but I really can't convince myself that what I'm doing is very ethical.

My to-do list includes making my crawler into a more well-behaved bot and that will have to go.


I think you may want to decouple your ethical analysis from which private company is making the most money. Remember that the only functional difference between you and somewhere like kayak.com or padmapper is business relationships.


I think PhantomJS has a bit of a giveaway in the headers where 2 lines are reversed compared to a normal Chrome, I always thought AngelList detected this flaw. Although I have heard there are builds of PhantomJS where this flaw does not exist.


Mind giving a quick description of getting PhantomJS to work?


I don't know why more people don't use chrome extensions for scraping. Using a boilerplate[1], you can get a scraper up and running in minutes. Start a node server that serves up urls and stores parsed data, and run the scraper in the browser. Best of all, you can watch it running and debug if something goes wrong. I know it doesn't scale well if you're running a SaaS, but for personal projects and research/data normalization it's the lowest barrier to entry, in my opinion.

[1] http://extensionizr.com


Sorry guys, hit by traffic - just scaling my EC2 at the moment.


No worries, we had your page scraped just in case ;)

Google Cache link: http://webcache.googleusercontent.com/search?q=cache:https:/...

Archive.is link: http://archive.is/DQccs


haha :)


Is it common for developers in the eCommerce space to use scrapers as a means to aggressively push automated price-match algorithms? I've been asked to do this a number of times, was just curious as to how prevalent it is.


Yes, everybody scrapes the prices of the others.


With stuff like Facebook opengraph (e.g. og:price) and other meta tags meant to help search engines and social networks get this sort of data to display inline, do you think it's inevitable that complex scraping will no longer be needed in a practical sense since everyone will be inadvertently optimizing their markup in a way that you could write a really simple parser to grab the data?


Given that FB's bot identifies itself, no, eventually some websites will present og: markup only to FB's bot.


Couldn't you just spoof the user agent though? Or is there some other mechanism that you can use to verify the bot is Facebook's?


IP address.


Most companies will use resources not in their datacenter and not identifiable. Executives know it's sketchy, but they do it anyway.


I just like to know, how much traffic did you got ?


ok on m4.4xlarge now :)


Good stuff.

I do a good bit of scraping, and made RubyRetriever[1] to make my life easier but it seems like I'm getting roadblocked on occasion, probably due to some of the things you mention in your article.

Is there any way for a site to verify that only their JS and CSS files are linked? Like preventing injection?

[1]: https://github.com/joenorton/rubyretriever


You could inspect the src attributes of script tags, and the href attributes of link tags with rel="stylesheet", for acceptable domains. I doubt it would cover all cases, but it might be a start.


I got the 100th star on the repo! What do you mean by the verifying part?


At Feedity (https://feedity.com), we "index" webpages to generate custom feeds. Over the years, we've designed our system to use a mix of technologies like .NET (C#) and node.js, and implemented a bunch of tweaks and optimizations for seamless & scalable access to public content.


Any tips and tricks you are able to share about the technologies you guys developed? It would be especially interesting to see what you use for text extraction from HTML.


> But if you are automating your exact actions that happen via a browser, can this be blocked?

Yes, by checking times between actions and number of actions in a time period, and blocking atypical activity. I was IP banned from a site once for a few months, after trying to scrape it too much and hitting links on the site that were hidden from humans.

The random wait settings specified in the post are better than nothing, but still too flimsy. You would need to put hours between requests, only request during a certain 15 hour periods, take days off, and eventually you aren't scraping regularly enough to do much good.

Scraping is not an API, and I should know- I used to do it for a living. Its unreliable. It requires constant maintenance. APIs can break too, but they are meant for the sort of consumption you are trying for.

If you scrape for a living, only do it as a side job.


It really depends on the data you are scraping. My main business relies on scraping and my data mining application has been running for over 5 years. If you have enough IP addresses available to you, it becomes almost impossible to distinguish it from normal users hitting the site...and bandwidth has gotten so cheap, the overhead is very affordable.

I've noticed that most sites actually don't change that often. I deal with changes once or twice every 3 months.

"If you scrape for a living, only do it as a side job."

This is true if you are scraping the low hanging fruit. I scrape 40+ sources (I do have access to a few APIs as well) and then have to extract the patterns/data I need to then integrate it into my business model. This is all automatic now and I only work on upgrading for speed and efficiency.

If you have to scan millions of urls daily from 1 site, it's probably not going to work out. You need to figure out clever ways of getting the data and using it without breaking any laws or pissing off the site owner.


Not scraping but banks don't even do this for their security which I found surprising. I just finished building a chrome extension (https://chrome.google.com/webstore/detail/uyp-free-blasts-th...) that auto-logins into pretty much any bank or financial web site without having to type anything. The key difference to other password managers is it can auto-fill pretty much anything.

I guess it's part password manager (it stores passwords encrypted in browser storage, not remotely) and part automation wizard :)


I actually love Selenium for this purpose, for much the same reasons the author mentions here.

It's almost impossible for a website to reliably detect that a client web browser is being automated, and I find I can make Selenium scripts much more adaptable to breaking changes in websites when they occur than I can when hooking up my code directly.

I actually disagree with the contention that Selenium is slower than directly scraping though. The Firefox driver has always been lightning fast for me and the bottleneck is almost always server requests that would have been necessary either way.


Whilst they mean well, I find this a fundamentally deceptive — the arduous parts of "real world" scraping simply aren't in the parsing and extraction of data from the target page, the typical focus of these "scrape the web with X" articles.

The difficulties are invariably in "post-processing"; working around incomplete data on the page, handling errors gracefully and retrying in some (but not all) situations, keeping on top of layout/URL/data changes to the target site, not hitting your target site too often, logging into the target site if necessary and rotating credentials and IP addresses, respecting robots.txt, target site being utterly braindead, keeping users meaningfully informed of scraping progress if they are waiting of it, target site adding and removing data resulting in a null-leaning database schema, sane parallelisation in the presence of prioritisation of important requests, difficulties in monitoring a scraping system due to its implicitly non-deterministic nature, and general problems associated with long-running background processes in web stacks.

Et cetera.

In other words, extracting the right text on the page is the easiest and trivial part by far, with little practical difference between an admittedly cute jQuery-esque parsing library or even just using a blunt regular expression.

It would be quixotic to simply retort that sites should provide "proper" APIs but I would love to see more attempts at solutions that go beyond the superficial.


> the arduous parts of "real world" scraping simply aren't in the parsing and extraction of data from the target page, the typical focus of these "scrape the web with X" articles.

I can agree with this after having written a scraper as part of core business functionality (we paid a company for access, but access was just to bare HTML blobs and CVS and not an actual API).

However, to what degree you want to do all this is negotiable whereas the 'core' of screen-scraping is not---all scrapers have to first figure out how to get text, parse it, then stick it back in their system.

An example of what I mean when I say 'negotiable' is....

> working around incomplete data on the page

Deciding how to do this depends on your problem domain. Sometimes, we'd get bad computed data from our source but not care because it just meant more work putting more work in calculating it from a more raw source.

> not hitting your target site too often

If they publish how often you are allowed to scrape, this isn't too difficult. If not, then trial and error is the only solution. On occasion, a site simply just doesn't know/care. For example, in my case, the site was static content behind a CDN, so that if we were anywhere under 200 req/second then no flags would ever be raised.

For most smaller sites, that you are unofficially scraping, you may be limited to 1 request every 2 seconds.


What bother me the most is that recently I wanted to extract and archive of all the threads I participated in from an Internet forum. The webmaster told me that the BBS he use don't provide such a function and that I just had to download each thread manually... (300+ thread in my case).

He then say that it don't bother him if I scrape theses thread. And I'm currently figuring out how to manage his site's cookie protected search feature, so that my painstaking effort (I'm not a dev, more a DB guy) could be reproducible more easily by other users of this service.

But this shouldn't appen in the first place because all post of this service are stored in a cleanly organized MySQl DB. Yet as no method is provided the only way to get back structured data is by scrapping (as the webmaster told me that no, he won't run custom SQL because "he don't want to mess his DB").

So even if all the data is publicly available through the internet forum only a geek can download a personal archive... or google because google scrape and store everything.


It's overkill for most things, but I have found that on occasion the best way to scrape stuff behind annoying frontends is with Selenium. pysaunter is a useful library that's one layer of abstraction higher, if you're familiar with Python.


Well I see now that I'm really late to the party with that comment.


As someone who does a lot of scraping, I was happy to learn about Antigate :)


Just joking as I don't scrape unless scraping is allowed. :)


It's trivial to scrape public Instagram URLs...

https://github.com/kingkool68/zadieheimlich/blob/master/func...


Does anybody know what the author means by "lead" (noun)?

I don't think it's any of the regular meanings: http://www.ldoceonline.com/search/?q=Lead

But it doesn't seem to be any of these slang terms either: http://www.urbandictionary.com/define.php?term=lead



Have you run into any issues from running all of your scrapers off of AWS, or just from sites detecting that you're accessing large numbers of pages in some sort of obvious pattern? I guess I was hoping there would be sites with more interesting ways to screw with web scrapers (rearranging certain page elements or something) than just throwing up a CAPTCHA.


Most really don't. A lot of big sites don't seem to care, at least in my experience.

The few that I've seen just 'ban' your IP for a few minutes. If you hit Wikipedia too much too quickly, they will essentially refuse to serve you for a while. It was a number of years ago I was doing it, but basically you would be scraping then you would just stop getting info (Maybe I wasn't reading response codes and could've realized quicker what was happening)


Wikipedia provides you with an API and guidelines on how to use it, so you really shouldn't be scraping it directly or so much you hit enforced limits.


Wikipedia provides archives of all its content.

No need to scrap it when you can readily download a nicely formatted .xml.zip file containing all knowledge written by mankind.


"It was a number of years ago..."


I'm not actually doing a lot of hits, so it's generally been ok. I can just rotate my IP or solve the CAPTCHA.


A surprisingly small number of sites care. There are some really fun things one can do with random class/id/order variations. It's also fun to feed garbage data to scrapers when you can identify them with very high probability.

But there seems to be little demand for these kinds of systems and just throttling/blocking/CAPTCHA solutions are much simpler.


There are definitely some sites that block entire ip blocks (ex: all of aws). The only real way around this is to use proxies, but if a site's trying to block you, it's probably best to comply, and just stop.


> But if you are automating your exact actions that happen via a browser, can this be blocked?

Of course it can! You won't be able to defeat even the simplest attempt on anti-scraping based on statistical data. Like even keeping a list of individual rate-limits for /16 subnets of actual visiting users and you are in trouble.


Does cheerio account for single page apps? In any case thanks for the tutorial!

Anyways I added your stuff here along with other data mining resource:

https://github.com/kevindeasis/awesome-fullstack#web-scrapin...


To fight scrapers, we show some values as images that look like text (but not all the time)

And we insert random (non-visible) html and css classes in our site to screw with em, and use randomized css classnames. This fucks with xpaths and css selectors.

You can't stop them, but you can make their lives painful.


> To fight scrapers, we show some values as images that look like text

You are fighting screen readers more than anything; as well as legitimate plugins, form autofills, etc. If this is for captcha, you are fighting all the users as well.

> And we insert random (non-visible) html and css classes in our site to screw with em, and use randomized css classnames.

Legitimate browser plugins, etc. I'd just use electron or selenium with `nth-child`, `:visible`, `[class*="…"]`, etc.

What you effectively doing is wasting time on useless stuff. This is even more useless than trying to prevent copying of DVDs or pirating games.


> What you effectively doing is wasting time on useless stuff. This is even more useless than trying to prevent copying of DVDs or pirating games.

Can you be so sure? The Union blockade of the Confederacy had plenty of holes, and smugglers / privateers / blockade-runners made good money getting through (when they survived) ... but that doesn't mean the blockade wasn't effective all the same at weakening the Confederate military and economy.


Do you really not see the difference between military blockade and randomizing CSS classes?


Honestly, no. This one time I was pissed off at Egypt for undercutting me in cotton prices, so I tried to set up a blockade to prevent merchant ships going in and out of Cairo.

... and it would have worked, too, except for my naval vessels were all CSS classes. I even tried to name them cleverly, a la "USS hero unit" or "USS datatable table-consensed span9", but my plan was foiled.


Only one guy has to beat it for it to be widely disseminated, though.


Except traffic from known scrapers (or what appear to be) is down 20%

Sure, xpath and css selector experts can figure it out, but that's not everyone


I don't understand, why only 20%? If the traffic is from known scrapers why can't you just render "scrap off" ie easily get rid of them?

And traffic from good scrapers is of course pretty much impossible to measure so you don't know how big percentage of scrapers you got rid of in total.


If the scraper gets back nothing, they know they've been spotted and will make adjustments. Easy to check for automatically. If you alter the page to feed them garbage, it takes longer to notice.


It's easy for someone viewing some logs to say "ok this is very likely automated scraping", though it can be harder to automate detecting this. In the same way that porn is obvious to a human, but not a computer.


But it is not magic, though, and surely can be automated.

In fact, I would argue that your time might be better spent on this, instead of randomizing CSS classes. If you end up building something worthwile, that could be a great product too! (Look at all those CDNs / Anti-DDoS platforms, sounds like they could've been started this exact way.)


This also hurts accessibility for disabled users.


Not if the img alt text is the same


If the alt text contains the data then all you've done is make life slightly inconvenient for scrapers.


It would be better to invest that time in making an API so they don't need to scrape.


Haven't come across those yet but yes I guess it could be painful.


Hooray Melbourne! Would be interested seeing this at a meetup group if you were thinking of presenting.


Another 3000'nder. Would be great to see this turned into a talk somewhere.


For sure. Trying to think which ones. Probably the MelbJS one and maybe dddmelb? You could modify it to talk at the OWASP one perhaps. Which ones have you been to?


While everyone is busy debating whether scraping is bad or legal, I just can't stop thinking a out Antigate.

Of the sweatshops that must have been setup to deliver this service. That, is to me the true horror of this story.


I wonder how effective the CloudFlare anti-scrapper protection is against this approach of breaking captchas.

Also, I find it interesting that big websites don't just block all traffic from AWS IPs as they do with Tor.


There can be legitimate traffic coming from AWS, if not the site itself.

It's especially true when the site provides an API and is meant to be integrated by people/companies. In which case, the AWS traffic is likely to include major and/or important and/or paying customers. You really don't want to block that.

On the other hand, Tor is likely to be 90% evil. When in doubt, just block it. (That makes me think, I should run some proper stats and maybe publish a blog post about that. )


> There can be legitimate traffic coming from AWS, if not the site itself.

The traffic from the site itself, if it's hosted there, would come from the intranet IP address, right? Not the public facing one.

> It's especially true when the site provides an API and is meant to be integrated by people/companies. In which case, the AWS traffic is likely to include major and/or important and/or paying customers. You really don't want to block that.

Agreed, but it's fairly easy to block the AWS IP traffic on web endpoints and not on the API endpoints.


I think I might have some trouble with some reCAPTCHA stuff, but there must be ways around it. I agree with you on your point about AWS.


There are a fair number of people in China etc running personal VPNs on AWS.


And from the trenches:

- rails application

- scraping with nokogiri gem on Ruby

- simple models doing the scraping in rails app

- some scraping is parsed with CSS selectors - nokogiri

- some scraping is parsed with regex - nokogiri

- persisting to DB, Text, even Google docs

- presentation on web, text, pdf, xls

Boom


How do you push a button like hit next on a paginated page?


Right click on the 'next' button in chrome and use 'inspect element' to find its id/class/css selector and then:

browser.findElement(webdriverio.By.id('#Next')).click();


There is so much that's missing from this. What about gathering tokens from customers vs. paying for social data feeds? How about canned services like 80legs?


Hmm yeah there are a lot of other things that I could write about. 80legs seem like another Scrapy type of SaaS? Not sure what you mean about gathering tokens from customers.


I've heard of companies that scrape on behalf of customers who will walk marketing people through the process of creating an API token to help mitigate rate limiting.


Currently getting 502 Gateway. Guessing this post is also trending on reddit and we hugged it to death :(.


Just upgraded my EC2 :)


I'm on Reddit?


Google Analytics > Acquisition > Source/Medium > type "reddit" in search bar. Add secondary dimension "referral path"


It's a total guess. HN rarely hugs sites to death compared to Reddit (IMO).


When I was building liisted.com I scraped using Selenium and it worked great.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: