Preventing Site Scraping

mikeash · on May 17, 2012

How to prevent site scraping:

1) Don't have any data worth scraping.

2) Charge for access.

3) Provide APIs so people don't need to scrape your site.

Trying to essentially DRM your web site so that it's human-readable and not machine-readable is not only inherently impossible to do effectively (like any DRM), but is also solving the wrong problem.

Smudge · on May 17, 2012

1 and 3 are the most effective. Option 2 is useful until there is any real demand for your data, and then you're back to trying to prevent automated scraping by your paying customers.

jdale27 · on May 18, 2012

If you are charging someone for access, then presumably you are requiring them to log in to access the data. In that case, it should be pretty straightforward to throttle or completely block them if they are doing something (e.g., scraping) that violates your terms of service. Then you charge them again to use your API.

jack-r-abbit · on May 17, 2012

#3 is only an option if you are trying to prevent scraping just to reduce your bandwidth caused by rapid, repeated page requests. If you are trying to prevent someone one from just coming in and scooping up all your data, then providing an API is worse than just allowing the scraper to scrape.

There is no bullet proof way to stop it. So you make it as painful for the scraper as possible. I like the randomized classes/ids and the extraneous random invisible table cells and divs.

mikeash · on May 17, 2012

My point is that trying to provide data that can be used in one way but not another way is a pretty ridiculous concept. If you have a lot of data, you should have an API.

The idea that you can have a ton of data accessible on the web but somehow only let humans access it, and not computers, is quite simply untenable. My point with #3 (and to a lesser extent #1) is that it makes little sense to solve this problem at all.

jack-r-abbit · on May 17, 2012

> a pretty ridiculous concept

That is an opinion. One that I don't agree with. Companies spend a lot of time and money collecting/refining/cleansing/etc the data so they can present it to their users. It is not ridiculous for them to want to keep some script from capturing it all in one swoop. A lot of sites have clauses in the TOS that prohibit mechanized data harvesting. But there are a lot people that place zero value on the work others do to create something so they wait for someone else to do the work and then just come in and share it. If you sit at your school desk and wait for the girl next to you to finish her exam and then you lean over and share her answers... is that not cheating? Sure she doesn't own the data she just wrote down. And most likely it was all in the course book anyway. And of course you aren't stealing her answers since you don't effect her ability to turn in her exam. But there is the total lack of respect for the time and effort that other people put into doing the work. This everything is free and share ALL the things attitude is not sustainable.

mikeash · on May 17, 2012

Um, wow. I was actually just discussing technical feasibility. Nowhere did I state, or imply, or do I believe, that we shouldn't respect the time people put into work.

I'd explain what I meant further, but by jumping into a full-blown rant on me on piracy after I expressed a technically minded opinion about the futility of DRM on a web page, I think you've clearly demonstrated that you're not worth engaging further.

jack-r-abbit · on May 17, 2012

sorry. I misunderstood your comment. my bad. a bit punchy today. we cool?

Edit: Yes. Technically it does seem futile. It does become a game of cat and mouse (not unlike the spam/anti-spam battle over the years). Slow them down. Make them work for it. I used to work with a crazy old engineer that always told me: "I don't lock my car. If a thief wants my radio... he'll get it regardless. I'd rather deal with just a missing radio instead of a missing radio AND a broken window." Did I mention he was crazy? lol But I thought there is probably a large subset of people that would steal a radio out of an unlocked car but would not break a window to get in. Who knows.

mikeash · on May 18, 2012

Yes, we cool.

I can understand the "slow it down" idea, but it just doesn't seem worthwhile. If your data is that valuable, charge for access. Then it doesn't matter if they download 100 pages a second, since they're then paying for 100 pages a second.

Smudge · on May 17, 2012

None of this is false, exactly.

But if you put data on the internet for people to view in their browsers, you should not be surprised when people consume the data and use it for their own purposes. Site scrapers fall in this category.

As mikeash said, trying to obfuscate your site (because it can be consumed by more than just a person's eyes) is solving the wrong problem. You can't have (own) your cake (data) and eat it (render it publicly) too.

lusr · on May 18, 2012

Your analogy is flawed. When you sit for an exam you agree not to cheat (whether explicitly, as a student who has signed acceptance of an offer to study subject to the institution's regulations, or owing to the law of the land).

I don't think anybody is arguing everything should be free and uncopyrightable. People are simply observing that preventing information flow is difficult, and - whether you like it or not - there are plenty of people out there who don't share your sense of ethics. Furthermore, much data is uncopyrightable (facts) and it's only the presentation and organisation of the data that is copyrightable. If your business can be destroyed because somebody copied your data I'm sorry to say but you didn't actually have a business to begin with.

rhizome · on May 17, 2012

In other words, "you can't puts ads on an API."

jack-r-abbit · on May 17, 2012

That is true. I'm sure there are sites that don't provide an API for that reason. Ad revenue on the site might be what allows them to provide the data for free. But that wasn't what I was talking about.

vasco · on May 17, 2012

Not one of the methods listed here would detract a decent scraper. Moreover you would either screw with your users or with SEO if you wanted to make this or that technique more aggressive.

If your database has really great content it's not because some kid has copy of your website online that you'll lose users. Stackoverflow has been scraped to death and nobody goes to the other sites to check out answers.

josephcooney · on May 17, 2012

I don't think there is even any need to scrape stackoverflow. They make all their data available for free.

pinchyfingers · on May 18, 2012

Right, but other sites scrape that free info and republish it with the goal of drawing traffic. Apparently, Stack Overflow has an API. I don't know the functionality of their API, but it doesn't matter: via API or web scraping, the content of the SO database is all over the web, yet SO is thriving.

josephcooney · on May 19, 2012

You said "other sites scrape that free". I agree that info is all over the web but it isn't scraped, and I don't think you should keep calling it scraping.

inDesperateZone · on May 18, 2012

Yet there seem to be people who use their data to make a bit of money on the side. I haven't had time to do so, but I wanted to ask SO if the translated SO pages on humbug.in/stackoverflow are against any rules. He punches their pages trough googles translate and stuffs them full with banners.

Nearly drove me nuts because a friend who I was working on a project with kept searching in german. As a result he got the terribly translated pages first followed by the actual SO pages.

jumpbug · on May 17, 2012

They wouldn't indeed, but, it makes it more difficult in the hopes that they scrape a different data source.

vasco · on May 17, 2012

Do you really think a bunch of scrapers' time is more valuable then yours? I'm assuming that you have the ability to create this great piece of content that people want to have so I think your time would be better spent either creating more of said content or pursuing other endeavors other then making some kids waste an additional 15 minutes figuring out a way to solve an interesting puzzle.

ceejayoz · on May 17, 2012

If they can easily find another source to scrape the same data, said data probably isn't worth spending this much time and effort protecting anyways.

level09 · on May 18, 2012

people will not go there, however the other website will get accidental traffic which might be 2% of stackoverflow's traffic, this is a little bit of money for doing nothing but running an automated script.

ChuckMcM · on May 17, 2012

Try running a search engine. :-) Needless to say we get folks all the time who are trying to create or enhance data bases out of our index all the time. We even have an error page that suggests they contact business development in the unlikely event they don't "get" the fact that our index is part of our economic value.

One of the humorous things we found is that scrapers can eat error pages very very quickly. Some of our first scrapers were scripts that looked for a page, then the next page, then the next page. We set up nginx so that it could return an error really cheaply and quickly, and once an IP crossed the threshold, blam! we start sending them the error page. What happened next was something over 20,000 hits per second from the IP as the page processing loop became effectively a no-op in their code.

We thought about sending them SERPs to things like the FBI or Interpol or something so they would go charge off in that direction, but its not our way. We settled on telling our router to dump them in the bit bucket.

pygy_ · on May 17, 2012

Ajaxification can be defeated if you scrape using a headless browser like PhantomJS [+]. Actually all the markup/visual techniques you propose can also be defeated using Phantom. Dump the page as PNG and OCR it.

Honey pots suppose that the scrapper is an idiot... And even in that case, if he's dedicated, he'll come back later and will be more careful.

The only potentially effective solutions are those that preclude usability for everyone: truncating the content for logged out users. And even then, with PhantomJS and some subtlety/patience in order not to trigger flood detection, an attacker could probably get away with it.

[+] http://phantomjs.org/

ceejayoz · on May 17, 2012

> By loading the paginated data through javascript without a page reload, this significantly complicates the job for a lot of scrapers out there. Google only recently itself started parsing javascript on page. There is little disadvantage to reloading the data like this.

Well, unless you're visually impaired and using a screen reader... and it doesn't really complicate things for any halfway dedicated scraper, as your AJAX pagination requests probably follow the same predictable pattern as the non-AJAX ones would've.

jacobr · on May 17, 2012

Do you have any real-world examples of commonly used screen readers that can't handle JavaScript? Screen readers gets its content from the DOM in a browser, so if the browser is able to put it in the DOM, it should be available to the screen reader.

98.4% of screen reader users in a 2010 survey (http://webaim.org/projects/screenreadersurvey3/#javascript) had JavaScript enabled.

TeMPOraL · on May 17, 2012

And if a screen reader can see the data, so can the scrapper.

joshu · on May 17, 2012

AJAXification could make it much easier. GET -> json import.

paulgb · on May 17, 2012

Exactly what I was thinking. Heck, I've scraped using fake AJAX requests even when a legitimate API was available, simply because opening up firebug was faster than reading the API docs.

jamesaguilar · on May 17, 2012

This is exactly what I used for my project to scrape Path of Exile item data.

joshu · on May 17, 2012

Ha! I did the same for KoL once upon a time.

jumpbug · on May 17, 2012

That is why I suggest using as many of these in combination as possible. A lot of your basic scraping scripts won't have this functionality built in.

epoxyhockey · on May 17, 2012

I love the first comment on that post: It will be more of a pain than usual, but I will get the data. I always get the data.

- As mentioned: AJAXification the data makes it easier to grab.

- Convert text to images? I'll OCR it. http://code.google.com/p/tesseract-ocr/wiki/ReadMe

- Honeypot a random link? I don't scrape every link on the page, only links that have my data.

- Randomize the output? And drive your real users crazy?

I have found that the best deterrent to drive-by scraping is to not put CSS id's on everything. Apart from that, you'll need to put the data behind a pay wall.

cletus · on May 17, 2012

A lot of the people commenting on these techniques being fallible are missing the point: the idea isn't to make scraping impossible (despite the misleading title), it's to make it hard(er).

A determined scraper will defeat these techniques but most scrapers aren't determined, sufficiently skilled or so inclined to spend the time.

I've been curious about a variation of the honeypot scheme using something like Varnish. If you catch a scraper with a honeypot, how easy would it be to give them a version of your site that is cached and doesn't update very often?

ChuckMcM · on May 17, 2012

C'mon cletus give us a bit of credit. Are you telling us that a company with the World's Smartest Engineers(tm) doesn't already do exactly this with their custom front end machines? :-) It's one of the more entertaining new hire classes.

You are correct that perfection is not achievable and you don't even want to get so close that you get very many false positives. But honey pots are bandwidth, which for folks who pay for bandwidth as part of their infrastructure charge, its a burden they are loath to bear. Rather to simply toss the packets into the ether whence they came rather than bother waking up their EC-2 instance.

robryan · on May 18, 2012

They sure love to use GWT with an indecipherable exchange format though, have tried to scrape a few things in adwords before. I'm sure it is possible but there was enough of a deterrent for me to not bother.

barbazfoo12 · on May 17, 2012

4. Provide a compressed archive of the data the scrapers want and make it available.

No one should have to scrape in the first place.

It's not 1993 anymore. Sites want Google and others to have their data. Turns out that allowing scraping produced something everyone agrees is valuable: a decent search engine. Sites are being designed to be scraped by a search engine bot. This is silly when you think about it. Just give them the data already.

There is too much unnecessary scraping going on. We could save a whole lot of energy by moving more toward a data dump standard.

Plenty of examples to follow. Wikimedia, StackExchange, Public Resource, Amazon's AWS suggestions for free data sources, etc.

FuzzyDunlop · on May 17, 2012

One might argue that indexing from a data-dump will lead to search results that are only as up to date as the last dump.

In StackExchange's case, most of these are now a week or more old.

Maybe it's a good idea, but I'm not sure how many would want to dump their data on a daily basis to keep Google updated, when Google can quite easily crawl their sites as and when it needs to.

barbazfoo12 · on May 17, 2012

Have you considered rysnc? Dropbox uses it. So lots of people who don't even know what rsync is are now using it. We could all be using it for much more than just Dropbox. And if you have ever used gzip on html you know how well it compresses. The savings are quite substantial. Do you think most browsers are normally requesting compressed html?

minikomi · on May 17, 2012

It could be /data.zip like /robots.txt

Cyndre · on May 17, 2012

Here is my approach on how to find scrappers.

They are already supplying fake data to see if they are being scrapped.

Using this fake data they can find all the sites that are using their scrapped data. Congrats we now know who is scrapping you with a simple google search.

Now comes the fun part. Instead of supplying the same fake data to all, we need to supply unique fake data to every ip address that comes to the site. Keep track of what ip, and what data you gave them.

Build your own scrapper's specifically for the sites that are stealing your content and scrape them looking for your unique fake data.

Once you find the unique fake data, tie it back to the ip address we stored earlier and you have your scrapper.

This can be all automated at this point to auto ban the crawler that keeps stealing your data. But that wouldn't be fun and would be very obvious. Instead what we will do is randomize the data in some way so its completely useless etc.

Sit back and enjoy

showerst · on May 17, 2012

In general I think that getting into an arms race with scrapers is not something that you will win, but if you have a dedicated account for each user you can at least take some action.

If this data is actually valuable, they should put it behind some sort of registration. Then they can swap out the planted data for each user to something that links back to the unique account, without wrecking things for users with accessibility needs or unusual setups.

goodside · on May 18, 2012

I have yet to see any anti-scraping method that can protect against a full instance of Chrome automated with Sikuli. It's obviously very expensive to run, since you either need dedicated boxes or VMs, but it always works. In my experience the most consistent parts of any web application are the text and widgets that ultimately render on the screen, so you easily make up for the runtime costs with reduced maintenance. You could in theory make a site that randomly changes button labels or positions, but to the extent you annoy scrapers you're also going to annoy your actual users.

dustywusty · on May 17, 2012

As pointed out by others, many of the suggestions here break core fundamentals of the web, and are generally horrible ideas. It's unsurprising to see suggestions in the comments such as, "add a CAPTCHA", which is nearly as bad of an idea. If you're willing to write bad code and damage user experience to prevent people from retrieving publicly accessible data, perhaps you should rethink your operation a bit.

basseq · on May 17, 2012

Generally speaking, if you're in the business of collecting data, but you have a competitive incentive not to share and disseminate that data as broadly as possible, you're in the wrong business. This article seems to address a problem of business model more than anything else. And if you're using technology to solve a problem in your business model...

chrisacky · on May 17, 2012

Let me start by saying that I am a sadochistic scraper (yeah I just made up that word) but I will get your database if I want it. This goes the same for other scrapers who I am sure are more persistent than even I am.

You don't have to read any futher, but you should realise that...

* People will get your data if they want it *

The only way you can try and prevent it, is to have a [1] whitelist of scrapers and blacklist useragents who are hitting you faster than you deem possible. You should also paywall if the information is that valuable to you. Or work on your business model so that you can work on providing it free.... so that reuse doesn't effect you.

---------------------------------

I thought I would provide an account of the three reasons why I scrape data.

There are lots of different types of data that I scrape for and it falls into a few different categories. I'll keep it all vague so I can explain in as much detail as possible.

[1] User information (to generate leads for my own services)...

This can be useful for a few reasons. But often it's to find people who might find my service useful.... So many sites reveal their users information. Don't do this unless you have good reason to do so. If I'm just looking for contact information of users, I'll run something like httrack and then parse the mirrored site for patterns. (I'm that paranoid that check out how I write my email address in my user profile on this site).

[2] Economically valuable data that I can resuppose....

A lot of the data that I scrape I won't use directly on sites. I'm not going to cross legal boundaries.. and I certainly don't want to be slapped with a copyright notice (I might scrape content, but I'm not going to so willfully break the law). But, for example, there is a certain very popular website that collects business information and displays it on their network of websites. They also display this information in Google Maps as Markers. One of my most successful scrapes of all time, was to pretend to be a user and constantly request different locations to their "private API". It took over a month to stay under the radar, but I got the data. I got banned regularly, but would just spawn up a new server with a new IP. I'm not going to use this data anywhere on my sites. It's their database that they have built up. But, I can use this data to make my service better to my users.

[3] Content...

Back in day... I used to just scrap content. I don't do this any more since I'm actually working on what will hopefully be a very succesul startup... however, I used to scrape articles/content written by people. I created my own content management system that would publish entire websites for specific terms. This used to work fantastically when the search engines weren't that smart. I would guess it would fail awfully now. But I would quite easily be able to generate a few hundred uniques per website. (This would be considerable when multiplied out to lots of websites!!!).

Anyway, content would be useful to me, because I would spin in into new content, using a very basic markov chain. I'd have thousands of websites up and running all on different .info domains, (bought for 88cents each) and running advertisements on them. The domains would eventually get banned from Google and you'd throw the domain away. You'd make more than 88 cents through affiliate systems and commission junction and the likes that this didn't matter, and you were doing it on such a large scale that it would be quite prosperous.

------------------------------------

I honestly couldn't really offer you any advice on how to prevent scraping. The best you can do is slow us down.

And the best way to do that is the figure out who is hitting your pages in such a methodical manner and rate limiting them. If you are smart enough, you might also try to "hellban" us, by serving up totally false data. I really would have laughed, if the time I scraped 5million longitude and latitudes over a period of a few months, if at the end of the process, I noticed that all of the lats were wrong.

Resistance is futile. You will be assimilated. </geek>

_3u10 · on May 18, 2012

Yeah as a scraper I'd say that at most all these suggestions would do is make me turn to selenium/greasemonkey instead of mechanize/wget/httrack. Selenium is the bomb when people try to get fancy preventing scraping, how exactly are they supposed to detect the difference between a browser and a browser?

Getting banned is not a big deal, plenty of IPs & proxies out there. EC2 is your best friend as you can automate the IP recycling. Even Facebook/Twitter accounts are almost free.

Even the randomization wouldn't be particularly difficult to circumvent just save the page and then use a genetic algorithm with tunable parameters for the randomization, select the parameters that yield the most/best records.

What I'd actually fear is a system that just silently corrupted the records once scraping was detected, especially if it was intermittent, eg. 10-75% of records on a page are bogus and only every few pages. Or they started displaying the records as images (but I'm guessing they want Google juice)

wpwebsite · on May 18, 2012

I actually came up with a very effective method for identifying scraping and blocking it in near real-time. The challenge that I've had was that I was being scraped via many many proxies/IPs in short spurts using a variety of user agents - so as to avoid, or make difficult detection. The solution was simply to identify bot behavior and block it:

1. Scan the raw access logs via 1 minute cron for the last 10,000 lines - depending on how trafficked your site is

2. parse the data by IP, and then by request time

3. search for IP's that have not requested a universal and necessary elements like anything in the images or scripts folder, and that made repetitive requests in a short period of time - like 1 second.

4. Shell command 'csf -d IP_ADDY scraping' to add to the firewall block list.

This process is so effective of identifying bots/spiders that I've had to create a whitelist for search engines and other monitoring services that I want to continue to have access to the site.

Most scrapers don't go to the extent of scraping via headless browsers - so, for the most part, I've pretty much thwarted the scraping that was prevalent on my site.

Silhouette · on May 18, 2012

I honestly couldn't really offer you any advice on how to prevent scraping. The best you can do is slow us down.

And the best way to do that is the figure out who is hitting your pages in such a methodical manner and rate limiting them. If you are smart enough, you might also try to "hellban" us, by serving up totally false data.

Well, no, there are other ways too.

For example, any site behind a paywall probably has your identity, and unless you live in a faraway place with impotent copyright laws -- and there aren't that many of them any more -- there are often staggeringly disproportionate damages for infringement available through the courts these days, certainly enough to justify retaining legal representation to bring a suit in any major jurisdiction. Given a server log showing a pattern of systematic downloading that could only be done by an automated scraper in violation of a site's ToS, and given a credit card in your name linked to the account and an IP address linked to your residence where the downloads went, I imagine it's going to be a fairly short and extremely expensive lawsuit if you upset the wrong site owner.

lusr · on May 18, 2012

Not all valuable scrapeable data is copyrightable. I also know of a number of sites I've scraped that don't even bother attempting to restrict your access to their data through T&Cs even though its the basis for their site (not that they'd have much of a legal basis for enforcing that, anyway). Ultimately if you're in the business of selling raw data with no value added, the problem is your business model, not scrapers.

Silhouette · on May 18, 2012

Not all valuable scrapeable data is copyrightable.

Sure, but a lot of it is, and even the bits that aren't may be protected by other laws such as database rights depending on your jurisdiction. I think anyone who maintains that you can't stop scrapers as a general principle is possibly a little unwise.

389401a · on May 18, 2012

There's nothing worse than spending lots of hard work scraping sites to build your search engine and then having bad guys perpetrate the scraping of your search engine.

Maybe it's some sort of karma. If you scrape, then you will get scraped.

_delirium · on May 17, 2012

I don't know what kind of site this is, so it's hard to say if it applies, but do note that several of these can significantly harm usability for legitimate users as well. For example, someone might be copy/pasting a segment of data into Excel to do some analysis for a paper, fully intending to credit you as the source; if you insert fake cells, or render the data to an image, you make their life a lot more difficult.

The first suggestion (AJAX-ifying pagination) can be done without a major usability hit if you give the user permalinks with hash fragments, though, so example.com/foo/2 becomes example.com/foo#2.

soulclap · on May 18, 2012

I am currently working on a project that involves some scraping as well. The most annoying things I came across so far are:

- Totally broken markup (I fixed this by either using Tidy first or just using a Regex instead of a 'smart' HTML/XML parser)

- Sites that need Javascript even on 'deep links' (I fixed this by using PhantomJS and saving the HTML instead of just using curl)

- Inconsistency, by far the most annoying: different classes, different formatting, different elements for things that should more or less be identical (basically fixing this whenever I come across a problem but sometimes it's just too much hassle and well, ask yourself if you really need to get every single 'item' from your target)

One more thing: RSS is your friend. And often you can find a suitable RSS link (that's not linked anywhere on the site) by just trying some URLs.

PS: No, I am not doing anything evil. If this project ever goes live/public, I'll hit all the targeted sites up and ask for permission. Not causing any significant traffic either.

kysol · on May 18, 2012

The best techniques to stop scraping, are ones not discussed in public.

garethsprice · on May 18, 2012

Anything that can be displayed on a screen can be scraped.

An approach I used to prevent scraping in the past is to start rate limiting anything that hits over N pageviews in an hour, where N is a value around what a high-use user could manually consume. Start with a small delay and increment it with each pageview (excess hits*100ms), then send HTTP 509 (Exceeded Bandwidth) for anything that is clearly hammering the server (or start returning junk data if you're feeling vengeful).

Added bonus is that the crawler will appear to function correctly during testing until they try to do a full production run and run into the (previously undetectable) rate limiting.

This project did not require search indexing so we didn't care about legit searchbots, but you could exclude known Google/Bing crawlers and log IPs of anything that hits the limit for manual whitelisting (or blacklisting of repeat offenders).

eps · on May 18, 2012

Not a 509, but a Deny rule in the firewall config. Works miracles.

x3sphere · on May 18, 2012

More trouble than it's worth. Plus, none of these solutions actually prevent site scraping... if the person is dedicated enough, they'll find a way. The time spent on implementing any of these approaches would be much better spent on site optimization, features, etc.

rehack · on May 18, 2012

Sorry to say, that these techniques will work for Enterprisey folks who anyway won't have to scrape.

For example you say 'Randomize template output'. Scrapers used a mixture of various techniques. Say, HTML Path does not work (despite supporting HTML wild cards in the form of body/table[1]/tr[*]/. Then you fall back to just some patterns could be the title of your data or anything.

Have scraped content coming in Flash also. Basically how can you stop anybody from understanding either a) data exchanged between the browser and the server Or (in case its encrypted/encoded) b) The HTML once its rendered.

Only way of doing it is by having your own custom browser, and preventing its source code getting leaked.

PS: I scrape, and our clients (whom we scrape) know that.

DigitalSea · on May 18, 2012

I've written a lot of scrapers myself and let me start off by saying there is no such thing as a site that can't be scraped. You can add in honey-pots all day long and at the end of the day once I've discovered my bot has been detected, I'll find a way around it. If the content is worth scraping and the site owner doesn't have an API (free or pay to use), then people will find a way to get the data regardless of what you do.

A well-intentioned article that puts forth a few great ideas for amateurs, but at the end of the day it's wasted time and effort that could have been better spent, oh I don't know, developing an API for your users instead.

If I want data I'll get the data, I always win.

jwdunne · on May 17, 2012

A less aggressive approach I've encountered is to insert links to other pages of your website with full URLs (http://www.example.com/page.html over just /page.html). Usually, a scraper will copy the links too. This should then make it obvious that the content's been scraped.

This could become a nightmare to maintain if you don't automate it. It'd be trivial to automate on a CMS. I know WordPress has loads of plugins for exactly this. I don't think I've come across something that can do this for static websites though, which make up for the brunt of the websites I maintain.

epoxyhockey · on May 17, 2012

wget --mirror -p --convert-links -P ./LOCAL-DIR WEBSITE-URL

chc · on May 17, 2012

If your data matters at all to the scraper, this won't present much of an obstacle. Fixing up internal links is really easy. Your average script kiddie could probably figure out how to work around this in a matter of hours. Like so many "tricks", it sounds like more trouble for you than for them.

netvarun · on May 18, 2012

IHMO, preventing site scraping is really really hard. There are couple of startups that offer products that claim to stop site scraping.[1]

1. Ajaxified paginated data ->

With Firebug looking at XHR requests, one can easily reverse engineer the ajax call and extract data directly from the json - in fact you are making the scraper's life easier as you will probably be storing the information in a structured manner in the json string. Or one could use a more sophisticated tool like MITMProxy to study how requests are made.

If you somehow managed to implement a highly obfuscated method of ajax requests (re: microsoft aspx.net), there is always selenium to get through them.

2. Randomize template output ->

You are going to annoy users if you are going to display a different template altogether. If its just randomizing div and class ids, one can write clever xpath expressions or css selectors to circumvent this. Or worse case there are always the ever reliable regex expressions.

3. HoneyPot ->

Scrapers only crawl pages that they are specifically looking for. A good scraper only runs through pages he wants to scrape. Nothing more. This is probably the least effective strategy.

4. Write data to images on the fly ->

Use an OCR api to decode them!

5. Alternatives ->

Putting in a login screen is also not effective as not only will that annoy users, it can easily be circumvented by using selenium or passing the cookie/session information to the scraping script.

Blacklisting ips is not going to a very effective strategy. With tons of free proxies, Tor and cloud based services (especially PiCloud[2] - which offers a scraping optimized instance!!), ip blocking can easily be circumvented.

Best strategy would be display corrupted content or start throwing CAPTCHAs if you sense a large number of requests coming from a particular IP.

But once again you may want to do some sort of machine learning on server logs based on the various ips and the specific urls visited and build a model that could predict if a particular user is a bot or a human before you start throwing fake data or captchas. Just to be on the safe side so that you don't annoy anyone.

[1] http://spider.io and http://blockscraping [2] http://www.picloud.com/pricing/ s1 instance plan

TazeTSchnitzel · on May 18, 2012

Some ideas:

- Use Unicode RTL override, then put in numbers/text backwards (this is also fun for messing with web forums, but that's another story...)

- Inject zero-width characters (such as ZWSP U+200B, WJ U+2060, ZWNJ U+200C, ZWJ U+200D)

- Use intentionally broken HTML in places, to catch out some parsers (browsers won't care, some scraping parsers will)

- Occasionally don't maintain element order in HTML, use CSS tricks to make it appear in the correct places

- Add useless dummy data hidden in places using complex CSS rules

- Require JS-generated auth token using an obscure, obfuscated algorithm to download the data, which expires immediately

simonster · on May 17, 2012

Google Scholar's solution is to show a captcha when a given IP has made too many requests in a given time period, although a scraper can easily throttle to avoid this.

You could require a captcha every n page views, or you could render the text of the page as a distorted image, which would defeat the OCR approach others have suggested here. These would make scraping difficult, but they would also mean throwing UX out the window, and they could still be defeated with Mechanical Turk.

wensheng · on May 18, 2012

Great! I learned a few scraping tricks from this article. In the past, all I did was using "time.sleep(3)" to pace my scrapping so it stayed off the scrapee's radar screen.

akrymski · on May 18, 2012

A friend of mine has been working on anti-scraping for a while. The question is: is there a market for an anti-scraping service? Would you pay so such a service and how much?

Some examples out there: http://blog.cloudflare.com/introducing-scrapeshield-discover...

http://www.blockscript.com/

ericcholis · on May 18, 2012

I've run into a few sites that "prevent" scraping. I just jump on http://developer.yahoo.com/yql/ and scrape away. The surface level defenses (ip blocking) are usually sufficient, but if there is a real developer behind the scraper, they will get to your content.

Too much prevention could be counter-productive, as you may inadvertently deny the friendly spiders.

ajitk · on May 18, 2012

In addition to mentioned suggestion I will add some here. Make difference layers difficult to understand and use (which will affect real user too!):

1. Script delivery. Use dynamically loaded modules.

2. Content delivery. Use websockets to deliver data. Does any developer tool show content going through websockets?

3. Content rendering. Use canvas to render content on the screen. Use fonts that make OCR difficult. Handwriting? :)

4. Use of browser plug-ing like Flash and Java.

5. Content delivered over video.

6. Quota based content delivery for requests originating from the same source. Use of multiple signals like cookies and IP addresses to pinpoint the source. Progressive

However, I would not recommend doing most of these. As mikeash aptly compared it to DRM, usability of the website will be affected negatively.

Like software, we can make it hard to reverse engineer them but cannot prevent a determined person from reversing it. There is no way to prevent a determined scraper. They will scrape your data using appropriate tools like real browsers, OCR or real people (mtrurk?).

Edit: formatting.

paulsutter · on May 17, 2012

Great ideas. Love the honeypot.

One more thought: once you discover a badguy, if you signal this immediately you are just speeding up his learning cycle. Don't just cut him off, tarpit him (gradually slow down responses) or return bad data.

irfan · on May 18, 2012

I agree with the argument * People will get your data if they want it* Reading the discussion here and on original post, I can make a scraper that would be difficult to detect and block ;-)

tlevine · on May 18, 2012

You could just print the data in a book instead of putting them on a website.

Oh woops never mind (http://www.diybookscanner.org/).

weixiyen · on May 17, 2012

Honeypots are pretty worthless. The article counters itself.

blrblr · on May 18, 2012

I can write a scrapper for any text-based website within an hour. As far as I know, site scrapping can't be prevented. You could make it harder, though.

level09 · on May 18, 2012

the only way to prevent scraping is to shut down access to your website, with the new modern libraries like phantomjs and selenium any one can write a scrapper that executes javascript and reads website pretty much like any human user.