Hacker News new | past | comments | ask | show | jobs | submit login

I have some sites where I specifically block archiving from some sections for good reason. (Even if I didn't have a good reason though it would still be my choice.)

I have a very big problem with them disregarding robots directives. Sure some crawlers ignore them: Hostile net actors up to no good. This decision means they are a hostile net actor. I'll have to take extreme measures such as determining all the ip address ranges they use and totally blocking access. This inconveniences me, which means they are now my enemy.

edit- For those interested: Deny from 207.241.224.0/22




I have easier solution for you: just shut down your site and be done with it. This way no malicious actor will be able to save your precious information.


Why not just block ia_archiver useragent in your web server for these paths instead? Also, I'm curious, what that good reason could be?


Can I ask what the good reason is?


Are you under the impression that individual web archivists don't also scrape websites of interest, and submit those WARC's for inclusion into the Wayback Machine, independent of the IA's crawlers?

Because believe me, we do...good luck banning every AWS and DO IP range.


Thank you for the tip. I wasn't aware of that, but it was not a problem to update the rules to account for the full AWS range based on the new information. I greatly appreciate your feedback. I am not sure what DO is though, would you be so kind as to deacronymize that for me, thank you.


We also run crawlers on our home laptops, on university servers, on every cheapo hosting service we can find (especially if they offer decent or "unlimited" bandwidth), and so on. Tools like wget and wpull can randomize the timing between requests, use regex to avoid pitfalls, change the user-agent string, work in tandem with phantomjs and/or youtube-dl to grab embedded video content...

Good luck playing whack-a-mole against the crawlers. I admit to being very curious what you're openly hosting online that you really don't want to get saved for posterity?


DigitalOcean.


Banning AWS and DO is pretty simple for those who care. If you're oriented towards people and not automation, you don't get a lot of false positives, but there are some real people behind proxies in AWS/DO.


I actually didn't know that. Do you operate the same crawlers?

I have considered putting a single file that is only accessible via no-follow links and perma-ban any ip that access the file, as a way to punish bad robots.


FWIW humans happen to be able to choose their user-agent at will.

Not so long ago changing your user-agent to one of the search engine bot as a simple workaround for some paywalls that appeared in search results was a thing.

It's also part of the techniques used to give extra privacy and messing with fingerprinting. For example random agent spoofer: https://github.com/dillbyrne/random-agent-spoofer




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: