I totally ignore it and my bot never get caught. If they catch me I will say that the script wasn't working correctly, but what you are saying is wrong, there's is NO LAW stating that /robots.txt must be obeyed. Therefore it's not my problem, I just don't follow your rule, but I have the choice too and you have the choice to block my IP too which I think is more harmful.
> there's is NO LAW stating that /robots.txt must be obeyed. Therefore it's not my problem
You're not wrong about robots.txt, you're wrong in a much more broad way. There is in fact an extremely dangerous law that could easily ensnare what you're talking about:
I don't know if the CFAA apply to my country, I know moreover that we don't need to comply with DCMA.
I don't thing that browsing a web page and saving it's content it's the same than scamming people by doing fake online site. This is growing in our country and the local police don't have any rights.
If it's a global problem we need to have global rules, we can't have Chinese not respecting Authors' rights and in the other hand only blame local people it's stupid.
Specially when it's non-tech people that do the rules, they don't know tech therefore should not say anything about it.
EDIT: You can be mad at me and down vote, but what I say is true and relevant. There's not only US in the world, specially when there's other way than protecting your site behind a robots.txt
We do have global rules: the Berne convention, which has been ratified by all 170 UN countries plus the Holy See and Niue, states that copyright is automatic and mostly universal, so any unauthorized copying is illegal. By having certain paths listed on robots.txt, the site are explicitly saying they don't authorize people to crawl them, so unless you have a license granting you permission, your legal position is probably iffy - CFAA or not.
Obviously some countries have a more lax enforcement than others, but don't be surprised if the US starts squeezing and one day you suddenly get a knock on the door.
I agree that humans will do what humans can do and bots will do what bots can do. Laws are murky and I don't wish to donate to lawyers. I believe engineering solutions when possible is the answer.
Using simple conditional tests in haproxy, I stop most of the bots from crawling anything more than my root page, robots.txt and humans.txt. Anything else gets silently dropped and the bots will retry for a while then go away. I don't see anything in the logs beyond the root page and robots/humans.txt any more.
Hey everyone, look & archive! This is where Jerome Renoux of Akamai announces that he doesn't believe in any morality beyond that codified in law, and how he will lie in court if you try to get him to behave decently.
Also thanks for spreading bad information.