I have some sites where I specifically block archiving from some sections for go...

omgtehlion · on April 21, 2017

I have easier solution for you: just shut down your site and be done with it. This way no malicious actor will be able to save your precious information.

andrius4669 · on April 21, 2017

Why not just block ia_archiver useragent in your web server for these paths instead? Also, I'm curious, what that good reason could be?

cprecioso · on April 21, 2017

Can I ask what the good reason is?

Asparagirl · on April 21, 2017

Are you under the impression that individual web archivists don't also scrape websites of interest, and submit those WARC's for inclusion into the Wayback Machine, independent of the IA's crawlers?

Because believe me, we do...good luck banning every AWS and DO IP range.

droithomme · on April 21, 2017

Thank you for the tip. I wasn't aware of that, but it was not a problem to update the rules to account for the full AWS range based on the new information. I greatly appreciate your feedback. I am not sure what DO is though, would you be so kind as to deacronymize that for me, thank you.

Asparagirl · on April 21, 2017

We also run crawlers on our home laptops, on university servers, on every cheapo hosting service we can find (especially if they offer decent or "unlimited" bandwidth), and so on. Tools like wget and wpull can randomize the timing between requests, use regex to avoid pitfalls, change the user-agent string, work in tandem with phantomjs and/or youtube-dl to grab embedded video content...

Good luck playing whack-a-mole against the crawlers. I admit to being very curious what you're openly hosting online that you really don't want to get saved for posterity?

heinrich5991 · on April 21, 2017

DigitalOcean.

toast0 · on April 21, 2017

Banning AWS and DO is pretty simple for those who care. If you're oriented towards people and not automation, you don't get a lot of false positives, but there are some real people behind proxies in AWS/DO.

tomjen3 · on April 21, 2017

I actually didn't know that. Do you operate the same crawlers?

I have considered putting a single file that is only accessible via no-follow links and perma-ban any ip that access the file, as a way to punish bad robots.

bigbugbag · on April 21, 2017

FWIW humans happen to be able to choose their user-agent at will.

Not so long ago changing your user-agent to one of the search engine bot as a simple workaround for some paywalls that appeared in search results was a thing.

It's also part of the techniques used to give extra privacy and messing with fingerprinting. For example random agent spoofer: https://github.com/dillbyrne/random-agent-spoofer