Robots.txt meant for search engines don’t work well for web archives

metafunctor · on April 21, 2017

It appears that IA applies (or did apply) a new version of robots.txt to pages already in their index, even if they were archived years ago. That's silly, and stopping doing that would probably solve much of this problem.

mbrookes · on April 21, 2017

I first came across that issue during one of the many Facebook privacy scandals. I'd found some juicy bits in (IIRC) a much earlier version of their privacy policy. But when I went back to it later, the robots.txt had been updated, and the earlier archives had been obliterated.

That just seems wrong.

anonymfus · on April 21, 2017

Next time resave the page on some other service like archive.fo.

djsumdog · on April 21, 2017

Is the only difference between archive.is and archive.fo is https?

m-p-3 · on April 21, 2017

I believe archive.is is also on https.

icebraining · on April 21, 2017

I don't think it's "silly". IA operates in a sketchy legal environment. There's no fair use exclusion for what they're doing, and it made sense to be extra careful and deferential towards website operators, lest they get hit by a lawsuit.

the8472 · on April 21, 2017

> There's no fair use exclusion for what they're doing

Fair use is not the only exception to copyright. US copyright law has a separate section on exceptions for libraries and archives.

ghaff · on April 21, 2017

There are [1] but they seem to be pretty rooted in making a copy of original works in a physical library. The parent's point that the IA operates in a very grey area of law and therefore needs to bend over backwards to comply with requests to remove material still applies.

[1] https://www.law.cornell.edu/uscode/text/17/108

jacquesm · on April 21, 2017

Ask yourself what you would rather have the IA spends it's meager funds on: buying hardware and paying people to do critical work or paying a bunch of lawyers to fight lawsuits against much better funded opponents they would lose anyway.

matt4077 · on April 21, 2017

I asked, and the answer was: "it's important to fight these fights, which is why I'm donating to the ACLU".

I believe Us Code 108 is relevant here. It starts:

    it is not an infringement of copyright for a library 
    or archives, or any of its employees acting within the
    scope of their employment, to reproduce no more than 
    one copy or phonorecord of a work[...]

There's obviously more to it that I haven't done research on, but that's a pretty good start and I wouldn't worry too much about lawsuits. In fact, if they were at risk of lawsuits, I don't see why respecting robots.txt would stop them–there's no "but you didn't tell me not to" excuse in copyright.

biztos · on April 21, 2017

If someone wanted to sue the Archive, they would probably argue that every time archive.org serves a file they are making a copy... which is true, after all, if anything reproduced digitally is a "copy" in that sense.

Nice point about the lack of implied permission in copyright. It makes me think robots.txt probably doesn't have any meaning one way or the other legally, but is just a community thing.

icebraining · on April 21, 2017

> If someone wanted to sue the Archive, they would probably argue that every time archive.org serves a file they are making a copy... which is true, after all, if anything reproduced digitally is a "copy" in that sense.

It's more than a theoretical point - that each "serving" of a file is a copy is well established legally. In fact, even loading a program to RAM was considered a copy, per MAI Systems Corp. v. Peak Computer, until Congress made an explicit exception.

zerocrates · on April 21, 2017

And that exception only applies for people doing maintenance on your computer.

ghaff · on April 21, 2017

>If someone wanted to sue the Archive, they would probably argue that every time archive.org serves a file they are making a copy... which is true, after all, if anything reproduced digitally is a "copy" in that sense.

It absolutely could be/would be argued. Otherwise an arbitrary library or archive--oh, lets give it a name like Google Books--would have the right to make digital copies of physical books available to the public. Obviously Google tried to do this and (although the case was/is complicated) they weren't allowed to do this unconditionally.

ADDED: Or, heck, any site could declare themselves an archive and offer up ripped CDs to the public.

jacquesm · on April 21, 2017

> I asked, and the answer was: "it's important to fight these fights, which is why I'm donating to the ACLU".

The ACLU and IA are two different entities, donating to the one does nothing to help the other.

> I believe Us Code 108 is relevant here.

Yes, it is.

> There's obviously more to it that I haven't done research on

Glad we got that out of the way.

> but that's a pretty good start and I wouldn't worry too much about lawsuits.

Well, since you're not operating the archive it isn't you that should be worried. And given that 'there is more to it that you haven't done research on' it is probably fair to say that lack of such worries thereof is a bit premature.

> In fact, if they were at risk of lawsuits, I don't see why respecting robots.txt would stop them–there's no "but you didn't tell me not to" excuse in copyright.

Because it shows effort on their side to not collect when copyright holders make a minimum effort to warn outside parties not to collect their data.

In the eyes of a judge - or a half decent lawyer - that will go a long way towards establishing that the archive made an effort to stay on the bright side of the line.

Law is interpreted, the fact that there is no such provision in copyright law doesn't mean that a judge isn't able to look past the letter and establish intent. If you are clearly in violation and refuse to do even the minimum in order to avoid such violations then judges tend to be pretty strict, in other words, they'll throw the book at you. But if you can demonstrate that you did what you could and that the plaintiff did not make even a minimum effort to warn others that archival storage or crawling is not desired then their case suddenly is a lot weaker.

See also: DMCA and various lawsuits in lots of different locations, the internet is far larger than just the USA and there are a number of interesting cases around this subject in other countries, some of those had outcomes that were quite surprising (at least, to non-lawyers).

I've copied Geocities.com when it went down and have had quite a bit of discussion with IP lawyers on the subject. So far I've been able to avoid being sued by responding timely to requests by rights holders. But that doesn't mean they would not have standing to sue me and if they do I might even lose.

This is not at all a settled area of the law and if you feel that the Internet Archive is in the right here no matter what then you could of course offer to indemnify them from any damage claims.

icebraining · on April 21, 2017

No help there, the exception is very limited in the number of copies it may produce, among other factors: https://www.law.cornell.edu/uscode/text/17/108

matt4077 · on April 21, 2017

A "copy", in this context, is a file. They can have three of those, which aligns perfectly with standard backup practices. Serving them is distribution, but not copying.

icebraining · on April 21, 2017

Yes, it is. Even loading a file to RAM was considered a copy (see MAI Systems Corp. v. Peak Computer, Inc) until Congress made an explicit exception.

bigbugbag · on April 21, 2017

What do you mean by sketchy legal environment ?

Couldn't they move operation to a non sketchy one, IIRC they anticipated the need for such a move due to trump and now have a backup ready in a different country.

icebraining · on April 22, 2017

I'm not saying the US is sketchy, I'm saying what they do is legally sketchy, considering copyright (which exists in the whole world). Though it's possible that some countries have archival exceptions that would cover them, I don't know.

throwaway91111 · on April 21, 2017

How does this relate to the robots.txt file? Even being deferential it doesn't make much sense to respect it.

frik · on April 21, 2017

Upvoted. That's exactly the problem, and the way to solve it.

Example: two months before the movie "The Social Network" got released to theaters in 2010 Facebook decided to add a robots.txt to Facebook.com. Immediately Archive.org deleted/disabled access to the archive how Facebook startpage looked in 2004-2010.

BTW. the correct way would be to activate archive access to Facebook.com for the 2004-2010 time-frame again. The "The Accidental Billionaires: The Founding of Facebook" book and the "The Social Network" film based on that book used of course partly Archive.org and various other research methods to get the facts.

thisacctforreal · on April 21, 2017

What about extending robots.txt to include date ranges?

For future domain-owners this is likely far too much control, but maybe that could be mitigated if IA tracks DNS/whois/registration info too

pimlottc · on April 21, 2017

If we did this, the date range would be set to "Forever" 99.99% of the time.

avereveard · on April 21, 2017

yeah it's not even hard to solve of a problem, just use the archived version of robots.txt that matches the crawl date

too bad they already lost loads of internet content that way

paol · on April 21, 2017

They didn't lose anything. Content excluded in this manner is only made inaccessible to the public, not deleted from the archive. They can change their policy retroactively.

patmcguire · on April 21, 2017

It may be possible technically. I doubt it's possible in reality: they've probably made promises to a bunch of people over the years that this is how it works. People furious that Their Content is appearing somewhere else and go straight into lawyers on the first email.

Chaebixi · on April 21, 2017

> It may be possible technically. I doubt it's possible in reality: they've probably made promises to a bunch of people over the years that this is how it works. People furious that Their Content is appearing somewhere else and go straight into lawyers on the first email.

They won't be furious when they're dead.

I think the main value of the Internet Archive is not so much in the near term, but in the long term. I hope in the future they enact some policy that ignores any robots.txt for scrapes older than, say, 50 years.

tempay · on April 21, 2017

What should happen in the case that a website misconfigures robots.txt and ends up wanting to remove private data?

I think I would be tempted to say that the data can't be removed to avoid abuse from future domain owners (or current ones) but I'm not sure if there would be any legal consequences of this attitude.

CM30 · on April 21, 2017

Provide a content removal form? It works for DMCA notices, it can work here. Maybe even have a 'reason' textbox to see why someone may want content removed...

makomk · on April 21, 2017

Of course, there's the inevitable risk that the Internet Archive's newfound control over who is allowed to make their past disappear into the memory hole and who has it archived forever will be used for political ends, especially since the ability to manually archive pages is already used this way by staff. (Take a look at Jason Scott's Twitter or that of the Archive Team sometime - lots of conspicuous manual archiving of stuff that's embarrassing to a certain US political party.)

The issue of curators' views biasing the contents of collections seems to be underappreciated in general in the digital age, for some reason.

textfiles · on April 21, 2017

Just to idly correct you.

Archive Team (not a part of Internet Archive) actually archives piles and piles of web-based material, sometimes in response to current events, sometimes because of known shutting down of services, and sometimes because of speculative worry about longevity. (For an example of the last one, we've been archiving all current FTP sites left.)

Meanwhile, Internet Archive's crawlers are bringing in millions (really millions) of URLs every day, just constantly grabbing websites, files, video, you name it.

There's certainly a "bias" to the current administration in terms of 1. They're in power 2. They keep removing things new and old. But think of it as us having a few lights shined in specific directions while thousands of other floodlights go literally everywhere.

TeMPOraL · on April 21, 2017

Here I'd lean towards archiving everything indiscriminately. Politicians especially should not have the "right to be forgotten", because what they do is of historical interest.

escape_goat · on April 21, 2017

To clarify the nature of the distortion you are referring to, it would be a sampling bias.

In general, the archive spiders the web and ingests information so that there is a certain mean frequency of visits and a certain likelihood of any particular revision of a web page being captured.

There would be instances in which data was entered into the archive more certainly and more frequently, on the basis of the nature of that data, than otherwise would have occurred.

What one means by bias when one says that this biases the contents of the collection needs to be understood with some care. It would be interesting to hear some historians discuss the matter. I do not think that it is a type of bias that is likely to lead them very far astray.

If it mollifies your concerns any, the last time I checked, anyone could manually archive any web page they liked. However, I would recommend writing to The Archive to express your concern.

I have an entirely partisan appreciation of the ability of The Archive to prevent redactions from the historical record of material that might later be disavowed. However, I share your more general view that there is no reason that the online history of any single major U.S. political party should be documented any less carefully than any other other.

avereveard · on April 21, 2017

Same when an archive crawl illegal/copyrigthed data, a pathway for that needs to exists anyway

jdelman · on April 21, 2017

Another solution might be including http://archive.org in their archives.

Asparagirl · on April 21, 2017

It's turtles all the way down!

libeclipse · on April 21, 2017

On the linked page, I see comments about ignoring the webmasters' wishes et al.

All I can say is f*ck that. It's a free and open internet. If you put content up on a public site, anyone has the right to go and look at it. Stop complaining when someone saves it.

And sure some people complain that scrapers slow down their site and that's why they use robots.txt, but really? Really? It's 2017 and your site is affected by that. I think you have bigger things to worry about.

bkor · on April 21, 2017

> Really? It's 2017 and your site is affected by that.

That someone wants to use a robot to completely scrape an entire dynamic website is their goal. A site is not responsible to make that possible. One bot causes _way_ more traffic and CPU usage than just a normal visitor or 1000s of visitors.

Saying '2017' or anything else: meh.

Various network operators are pretty helpful. Sending abuse complaints regarding misbehaving bots has resulted in actions before. I've seen action being taken from universities, ISPs, etc. Though normally the bots are auto-blocked (on IP address or ranges; quite easy to script).

robots.txt is an established / de facto standard. Ignore it, be prepared to explain why. IMO pretty much any country have computer hacking laws which are vague enough that to consciously ignore such a standard can be seen as "invading".

A "not my problem" approach: I think you should really think a little bit more.

madshiva · on April 21, 2017

I totally ignore it and my bot never get caught. If they catch me I will say that the script wasn't working correctly, but what you are saying is wrong, there's is NO LAW stating that /robots.txt must be obeyed. Therefore it's not my problem, I just don't follow your rule, but I have the choice too and you have the choice to block my IP too which I think is more harmful.

Also thanks for spreading bad information.

adventured · on April 21, 2017

> there's is NO LAW stating that /robots.txt must be obeyed. Therefore it's not my problem

You're not wrong about robots.txt, you're wrong in a much more broad way. There is in fact an extremely dangerous law that could easily ensnare what you're talking about:

https://en.wikipedia.org/wiki/Computer_Fraud_and_Abuse_Act

madshiva · on April 21, 2017

I don't know if the CFAA apply to my country, I know moreover that we don't need to comply with DCMA.

I don't thing that browsing a web page and saving it's content it's the same than scamming people by doing fake online site. This is growing in our country and the local police don't have any rights.

If it's a global problem we need to have global rules, we can't have Chinese not respecting Authors' rights and in the other hand only blame local people it's stupid.

Specially when it's non-tech people that do the rules, they don't know tech therefore should not say anything about it.

EDIT: You can be mad at me and down vote, but what I say is true and relevant. There's not only US in the world, specially when there's other way than protecting your site behind a robots.txt

icebraining · on April 21, 2017

We do have global rules: the Berne convention, which has been ratified by all 170 UN countries plus the Holy See and Niue, states that copyright is automatic and mostly universal, so any unauthorized copying is illegal. By having certain paths listed on robots.txt, the site are explicitly saying they don't authorize people to crawl them, so unless you have a license granting you permission, your legal position is probably iffy - CFAA or not.

Obviously some countries have a more lax enforcement than others, but don't be surprised if the US starts squeezing and one day you suddenly get a knock on the door.

LinuxBender · on April 21, 2017

I agree that humans will do what humans can do and bots will do what bots can do. Laws are murky and I don't wish to donate to lawyers. I believe engineering solutions when possible is the answer.

Using simple conditional tests in haproxy, I stop most of the bots from crawling anything more than my root page, robots.txt and humans.txt. Anything else gets silently dropped and the bots will retry for a while then go away. I don't see anything in the logs beyond the root page and robots/humans.txt any more.

shaki-dora · on April 21, 2017

Hey everyone, look & archive! This is where Jerome Renoux of Akamai announces that he doesn't believe in any morality beyond that codified in law, and how he will lie in court if you try to get him to behave decently.

avereveard · on April 21, 2017

they only ever needed to honor the robots.txt at the date of archival.

archive.org fucked up by making robots retroactive, if they used the archived robots.txt as a filter for a site at the relevant date, they'd have had the best of both worlds - respecting how sites work without losing how sites appeared at a date.

pbhjpbhj · on April 21, 2017

They'd be the flouting copyright laws, like Google do, but nonetheless tortuously. They're making a copy, which is already an infringement, distributing it seemingly against the owners express wishes is treated as a crime in some jurisdictions.

Retric · on April 21, 2017

Robots.TXT is explicit permission to make a copy, otherwise crawling is meaningless and that is not nessisarily reversible. Like putting up a yard sale sign, then trying to get the poeple that show up yesterday arrested for trespassing.

What the archive can do after that point is a different issue, but they clearly can keep a copy. Further, someone else is using the domain they don't nessisarily have anything to do with the archived data.

pbhjpbhj · on April 21, 2017

Robots.txt is usually explicitly permission for a robot not to crawl. But a robot crawling your site and an archive, cache, or duplicate page are all different propositions.

Google and others have enhanced robots.txt to enable permission for crawling (allow, sitemap), meta tags can deny archiving and various means allow permission to be explicitly denied for caching.

To use your analogy of raising a sign: if you don't put up a 'no trespassing' sign then it doesn't make trespassing legal.

FWIW I disprove of this state of affairs and consider copyright to be hugely defective in these respects.

>but they clearly can keep a copy //

It's nuanced but permission to access a page =/= permission to keep a copy. Just as you have explicit permission to access a video on YouTube but in most jurisdictions will not have permission to download it for later (commercial) use.

ghaff · on April 21, 2017

>To use your analogy of raising a sign: if you don't put up a 'no trespassing' sign then it doesn't make trespassing legal.

Right. And it's actually not a bad analogy as analogies go. Not having a sign doesn't make trespassing legal but if someone sometimes walks over a corner of your property and you go to the police to try to get him arrested, the first thing they'll probably ask you is if you have your property posted and/or if you bothered to ask him to stop. If the answer is no, they'll probably tell you to go away and do so and only come back if he ignores the sign.

pbhjpbhj · on April 22, 2017

Yeah, you come down to breakfast and there's a random person sat on your sofa, you ring the police and they say "meh, you didn't put a sign up"?

The requirement to post a sign to make trespassing an actionable offence is a USA thing AIUI, it's not a UK thing at least, but copyright is almost universal and doesn't require even adding a (c) mark, it's automatic at the point of creation under the Berne Convention. Or in other words you've pushed your analogy too far and hit a marked legal difference between USA physical property law enforcement and international intellectual property law.

ghaff · on April 22, 2017

Clearly entering a house is much different from the specific example I gave of walking across a corner of a property. Also, a number of number of European countries have various variants of right to roam that make trespassing a less actionable offense than in the US.

The more general principle is that if no harm is done and the individual/organization will just stop the action if you asked, the courts are often reluctant to get involved. There are exceptions of course, especially in the vein of making an example of someone to discourage others.

Retric · on April 21, 2017

Not having robots.txt is having no sign. So if you have the file you are putting up a sign that says something.

Also, it's meaningless for a bot to get permission without having permission to make a copy. There are arguments around the number of copies, but the clear implication is at lest all the routers can make a copy.

ghaff · on April 21, 2017

>So if you have the file you are putting up a sign that says something.

Not really. There are often default robots.txt files that the system just puts there in the course of building a default website.

The legally "right" way to do things is only archiving sites that give explicit permission to do so. But then, for all intents and purposes, you can't have a web archive. So we have the current ask forgiveness rather than permission system which works fine most of the time for organizations like the IA and AT. But it does mean that someone like the IA is inclined to err on the side of removing content if someone objects.

Retric · on April 22, 2017

That might hold water if the robots.text where the default on every site ever used by a company, but change even a single bit on any of them and what's left is clearly your intent on all of them.

Further, setting up a physical device connected to a public IP is never default behavior so you are putting up the sign in either case. So, at best your argument is someone athorised to do something put up a sign by mistake saying something that was not intended, but your intent has little relevance at that point.

Worse, your argument is based on the assumption that nobody knew what was going on so even simple coursework mentioning robots.txt would demonstrate knowledge and thus intent through willful inaction.

matt4077 · on April 21, 2017

IA has every right to scrape, save, and display. See US Code 108. Robots.txt is mostly meaningless in terms of the law, because it doesn't say anything about copying or distributing–only access, which is outside the scope of copyright.

pbhjpbhj · on April 22, 2017

I think you mean 17-USC-108. It might surprise you to know that the web extend outside USA and that Fair Use and exemptions for libraries and archives (aside: where's the definition of those two terms in the USC that's being used here?) don't extend to the World. IA may be legal in USA but a whole lot of the websites of the world reside on servers in other countries.

There's a minor technical problem with that USC too, it seems. It allows archives to "reproduce no more than one copy of a work". But to compare a website you make a second [admittedly transient] copy to decide whether to re-archive. That's technically not within the scope of that 17USC108 accommodation AFAICT. This may have been solved in US law; I've a feeling there was a modification of EU law to allow transient/cache copies?

reitanqild · on April 21, 2017

Anyone knows why this was downvoted?

matt4077 · on April 21, 2017

The law has specific exemptions for archival and search, which google and IA use. Do you really think they could do what they're doing when any one of millions of people could just take them to court at any time?

ghaff · on April 21, 2017

Well, they have been sued, e.g. https://arstechnica.com/uncategorized/2006/08/7634/ and they settled.

But they're not a target that you're going to collect big from, as a non-profit archive/library they're sympathetic whether or not that gives them any special legal standing, and they'll basically take down your content past and present if you ask them to.

So it will almost certainly cost you money to sue them, you won't collect much in the best case, and you can get your content taken down in about as much time as it would take you to pick a lawyer out of the phone book.

pbhjpbhj · on April 22, 2017

So my website is on a server outside the USA, what exemption are they relying on?

blowski · on April 21, 2017

What's your opinion on Google scraping your content and putting minimally attributed snippets?

hoschicz · on April 21, 2017

"minimally attributed"?

On Google? Really?

fenwick67 · on April 21, 2017

See this article for examples of how snippetsa don't always attribute correctly:

https://theoutline.com/post/1399/how-google-ate-celebritynet...

The author runs CelebrityNetWorth.com, which BusinessInsider cites, but the snippets cite BusinessInsider. So the user doesn't see the proper attribution.

chii · on April 21, 2017

> scrapers slow down their site and that's why they use robots.txt

a poorly written scraper may really slow down your site, especially if it wasn't intended to be scrapped repeatedly. There should be something to be said about frequency which scrapers should follow (specified by the website owner via a robots.txt like spec).

But website owners cannot demand unreasonable frequencies (such as once a year!), and what constitutes unreasonable is up for debate.

eknkc · on April 21, 2017

I don't think a poorly written scraper would follow robots.txt rules according to spec. So, in any case the site should have other measures (rate limiting?) anyway.

davb · on April 21, 2017

Additionally, if excessive scraping became an issue for my site I'd consider rate limiting client.

dingo_bat · on April 21, 2017

> (specified by the website owner via a robots.txt like spec).

Nope, if a website wants such a restriction, it must enforce it. Robots.txt is a request. It's worthless.

bkor · on April 21, 2017

If a robot misbehaves, it'll either be blocked or it'll go to the networks abuse section and that bot will be taken down. That a site possibly could have some kind of technical solution to this doesn't matter.

problems · on April 21, 2017

Precisely - the solution here needs to be that the server blocks the robot - if it can differentiate it from other traffic that is. That's all well and good and that's the solution which should be used here. If you don't want to be archived, block the IP.

timClicks · on April 21, 2017

The Crawl-delay directive is the de facto standard for this.

un-devmox · on April 21, 2017

> If you put content up on a public site, anyone has the right to go and look at it.

Fine.

> Stop complaining when someone saves it.

Fine.

What you don't say is that it is fine to recreate and publish that content against the owner's wishes, especially when said content is copyrighted in one way or another. You're failing to see the whole picture from the content owner's point of view.

hyperdunc · on April 21, 2017

Perhaps implicit in the comment you replied to is the idea that there's no such thing as a "content owner".

un-devmox · on April 21, 2017

That's the debate I was trying to initiate!

For instance, is it OK to crawl a blog with explicit copyright, save that data, then publish it elsewhere?

Houshalter · on April 21, 2017

Consider this very website, hacker news. Every single comment has it's own url. Which creates a huge number of redundant urls. There's no reason for archive sites to be scraping a separate copy of every single comment after it's indexed the thread. And it makes it harder to use google to search HN.

bigbugbag · on April 21, 2017

Aw man you should have been there 10 years ago, this website hacker news had a url to access every single comment. And quite often some of the comment were more informative than the link poster, I've bookmarked quite a few of those myself. But with hacker news gone they're gone too, because archive.org failed: though they do have the data they broke the links and it is now inaccessible to me.

Houshalter · on April 21, 2017

I see your point. Perhaps a better solution would be to get rid of permalinks to individual comments entirely and point to them with fragment identifiers.

tomjen3 · on April 21, 2017

They also make the content available to the public, which is directly competing with the site owners.

Thats inviting lawsuits they can't win and expecting people to pay the bandwidth for it too.

EvanAnderson · on April 21, 2017

The policy the Internet Archive applies re: "robots.txt" comes from an archive policy created at U.C. Berkeley in the early 2000's (The Oakland Archive Policy - http://www2.sims.berkeley.edu/research/conferences/aps/remov...).

Jason Scott (an employee of the Internet Archive) mentioned that the Archive doesn't ever delete anything. He stated that items may be removed from public access because of changes to "robots.txt" but they're not actually deleted. (That's a little comforting, at least.)

TeMPOraL · on April 21, 2017

Archive.org needs to be able to apply to itself. We could then use archive.org to view how archive.org in the past viewed some interesting site, thus avoiding the whole retroactive robots.txt fail.

;).

mushiake · on April 21, 2017

fastastic news.

Archive Team's take on this[0]

[0]http://www.archiveteam.org/index.php?title=Robots.txt

dingaling · on April 21, 2017

It is great news in general, but seems to be done in a clumsy and counterproductive manner that may cause the Internet Archive to be banned from crawling some websites.

The problem: when robots.txt for a website is found to have been made more restrictive, the IA retrospectively applies its new restrictions to already-archived pages and hides them from view. This can also cause entire domains to vanish into the deep-archive. No-one outside IA thinks this is sensible.

Their solution: ignore robots.txt altogether. What? That will just annoy many website operators.

My proposed solution: keep parsing robots.txt on each crawl and obey it progressively, without applying the changes to existing archived material. This is actually less work than what they currently do. If the new robots.txt says to ignore about_iphone.html you just do that and ignore it. Older versions aren't affected.

Basically they're switching from being excessively obedient to completely ignoring robots.txt in order to fix a self-made problem. I can only see that antagonising operators.

driverdan · on April 21, 2017

Archive Team is not associated with Internet Archive. AT does not crawl the web at large, it only targets specific sites.

duskwuff · on April 21, 2017

There's some value in allowing site operators to retroactively remove content which was never intended to be public. A common and unfortunate example is backups (like SQL dumps) being stored in web-accessible directories, then subseqently being indexed and archived when a crawler finds the appropriate directory index.

What needs to be fixed first is just the really common case mentioned in the blog post, where a domain changes ownership and a restrictive robots.txt is applied to the parking page.

Spare_account · on April 21, 2017

Here's a slight modification to the GP proposal:

- Respect robots.txt at the time you crawl it.

- If robots.txt appears later, stop archiving from that date forwards.

- Preserve access to old archived copies of the site by default.

- Offer a mechanism that allows a proven site owner to explicitly request retrospective access removal.

If archive.org have recorded the date that they first observed a robots.txt on the sites currently unavailable, they could even consider applying the above logic today retrospectively. Perhaps after a couple of warning emails to the current Administrative Contact for the domain.

pbhjpbhj · on April 21, 2017

>mechanism that allows a proven site owner to explicitly request retrospective access removal. //

It should be "a proven content owner", just buying a site shouldn't allow someone to remove it from archive.

ss64 · on April 21, 2017

How about you respect the robots.txt until the IP address where it is hosted changes. Once the IP has changed, then any new robots.txt exclusions apply only to the new pages not the archived pages under the old IP, which continue respecting the old archived robots.txt.

The IP address changing is a pretty solid indicator that control of that content has moved to a new organisation. Note this does not always coincide with the domain name owner changing.

A scenario that I can imagine becoming litigious: company owns a domain for promoting some product and they use robots.txt to prevent copies. The product reaches end of life and domain is allowed to expire. Someone else buys the domain and starts hosting content with no robots restriction. Archive.org start to display pages from the old company. Company then sues archive.org for copyright violation.

r721 · on April 21, 2017

>may cause the Internet Archive to be banned from crawling some websites.

It looks like Facebook banned ia_archiver (recently? I recall it worked a few weeks ago):

>User-agent: ia_archiver

>Disallow: /

https://www.facebook.com/robots.txt

rz2k · on April 21, 2017

The logic is sound, and I see that it was mostly written in 2011, but I can also see it being harmful.

How about an IETF RFC to clarify?

Libraries operate under a lot of unwritten social conventions, perhaps even more than most other institutions. (robots.txt even if largely ignored is a popular convention) Aggressive or confrontational wording, regardless of whether they are "right" doesn't seem in libraries' interests.

c0achmcguirk · on April 21, 2017

In the 90s I spent a lot of time on my website and I loved learning how web crawlers worked. I started using a robots.txt file without really understanding it. I ended up blocking everything thinking it would make my site faster for visitors because--crawlers might crawl the site all the time.

After I graduated from college I lost access to my website which was hosted on the Computer Science department's web servers.

I wish I hadn't used that robots.txt file. I would love to find the pages I made that compared interfold vs. exterfold staple strength, or the site I made with a ranch theme with a cowboy that had humorous advice....I don't have any content in archive.org because it honored the robots.txt file.

...sigh...wish I had backed up my stuff.

laumars · on April 21, 2017

To be honest I see robots.txt as a failed experiment since it relies on trust rather than security or thoughtful design.

hdhzy · on April 21, 2017

I don't think it's about security.

For example I've got a link to do delegated login like /login-with/github. When people click it an oauth flow will start. But it is useless for robots to follow so I disallow it in robots.txt. If they still follow nothing breaks and it's not a security issue but if I can avoid starting unnecessary oauth logins it's an additional benefit.

laumars · on April 21, 2017

robots.txt wasn't created for security but it can have security implications if you publish a list of Disallow paths with the intention of hiding sensitive content (sadly I have that seen that happen a lot) where as a better approach would be IP whitelisting and/or user authentication.

However I'm not claiming security is the only reason people use (misuse?) robots.txt. For example in your case you could mitigate your need for a robots.txt with a nofollow attribute[1]. Sure bad bots could still crawl your site and find the authentication URL without probing robots.txt so the security implications there is pretty much non-existent. But you've already got a thoughtful design (the other point I raised) that mitigates the need for robots.txt anyway so adding something like "nofollow" maybe enough to remove the robots.txt requirement altogether.

[1] https://en.wikipedia.org/wiki/Nofollow

dchest · on April 21, 2017

This is crazy, that's not what robots.txt is for. How can you complain about the security of a thing that is not meant to provide security?

According to your logic, newspapers are a "failed experiment because they rely on trust rather than security or thoughtful design". I published an article with my treasure map and told people not to go there, but they stole it.

laumars · on April 21, 2017

That was an anecdote since the previous poster raised the point about security. I'm definitely not claiming robots.txt should be for security nor was designed for security!

I said following proper security and design practices renders obsolete all the edge cases that people might use robots.txt. I'm saying if you design your site properly then you shouldn't really need a robots.txt. That applies for all examples that HN commentators have raised in terms of their robots.txt usage thus far.

I would rewrite my OP to make my point clearer but sadly I no longer have the option to edit it.

dchest · on April 21, 2017

design your site properly then you shouldn't really need a robots.txt

But how? For example, if you don't want a page to be indexed by Google, you add this information to robots.txt. Nofollow doesn't work for every case, because any external website can link to it, and Google will discover it.

laumars · on April 21, 2017

That's a good point. I'm not sure how you'd get around non-HTML documents (eg PDFs) but web pages themselves can be excluded via a meta tag:

    <meta name="robots" content="noindex">

Source: https://support.google.com/webmasters/answer/93710?hl=en

Interestingly in that article, there is the following disclaimer about not using robots.txt for your example:

"Important! For the noindex meta tag to be effective, the page must not be blocked by a robots.txt file. If the page is blocked by a robots.txt file, the crawler will never see the noindex tag, and the page can still appear in search results, for example if other pages link to it."

I must admit even I hadn't realised that could happen and I was critical of the use robots.txt to begin with.

hdhzy · on April 21, 2017

For PDF you can use X-Robots-Tag HTTP header [0].

Nofollow is a good suggestion of you control links to the resource, robots of you don't.

[0]: https://developers.google.com/webmasters/control-crawl-index...

dchest · on April 21, 2017

Ah, that's true, indeed. The page, though, will appear as a link without any contents, because the bot won't be able to index it.

laumars · on May 2, 2017

Except it has indexed it. It just hasn't crawled it. But content or not, the aim you were trying to achieve (namely your content not being indexed) has failed. Thus you are then once again dependant on other countermeasures that render the robots.txt irrelevant.

bigbugbag · on April 21, 2017

> robots.txt wasn't created for security but it can have security implications if you publish a list of Disallow paths with the intention of hiding sensitive content

Using robots.txt to secure your server from bots is the equivalent of attempting to secure your house from robbery by planting a sign that says "please,don't rob my house". Surprisingly it may works from time to time, by if you're into attempting security by wishful thinking maybe don't be too surprised when it fails about as much as security by chance.

laumars · on April 21, 2017

I know. With the greatest of respect your counter argument is literally just reiterating the point I was making. Albeit in the quote you've left off the part of my post where I said it's stupid to use robots.txt in this way.

hdhzy · on April 25, 2017

Note that links marked with Nofollow can still be followed by well-behaving bots: https://en.wikipedia.org/wiki/Nofollow#Interpretation_by_the...

nabla9 · on April 21, 2017

robots.text is not a security tool. It's communication tool that gives advice. Just like sitemaps.

If you need to add security (logins) to protect content you don't need to protect you inconvenience users.

laumars · on April 21, 2017

I'd already covered the security point replying to another poster (https://news.ycombinator.com/item?id=14163792) but just to be clear, I'm absolutely not claiming robots.txt is a security tool. Quite the opposite, I saying following good security and design practices renders the robots.txt file obsolete.

Your point about sitemaps helps illustrate that point of mine because having a decent sitemap mitigates the need for Allow lines in robots.txt. It's another feature of the web where robots.txt isn't well equipped to handle and thus there have been other, better, tools built to highlight pages of interest to search engines.

bigbugbag · on April 21, 2017

https://en.wikipedia.org/wiki/Robots_exclusion_standard#Secu...

robots.txt was proposed after a badly behaved bot DoSed a web server 20+ years ago, those were different times. With the robots.txt standed now those who wants to play nice can do so without asking anything, for the badly behaved ones it's still up to the admin to put forward the appropriate measures.

laumars · on April 21, 2017

Wow has it really been more than 20 years!?? I feel old now...

I do get what you're saying but if you have to implement "appropriate measures" anyway then the robots.txt file becomes completely redundant.

kakarot · on April 21, 2017

I came here to say something about respecting the wishes of others, etc, but you know what? You're absolutely right. We shouldn't even need to have a conversation about trust and respect.

It should be non-negotiable if you don't want your personal contents indexed by scrapers and archivers, and it should be enforced by design. It's a broken system.

bkor · on April 21, 2017

Lots of laws are pretty similar. e.g. technically you could steal loads of things. Practically you don't. Defeating/ignoring mechanisms such as robots.txt (vs maybe some security person in a store) still makes stealing not ok.

laumars · on April 21, 2017

The morality of whether bots should obey robots.txt is a separate issue to the point I raised about how you shouldn't trust bots to obey them. To use your example of high street stores: shops have security tags on expensive items / clothing as a method of securing products from theft because you cannot blindly trust everyone not to steal (though wouldn't it be great if that wasn't the case). Equally websites cannot trust that bots will obey robots.txt. Which means any content that doesn't want to be crawled needs to be behind nofollow attributes or (if it's sensitive) user authentication layers and any content that does need to be indexed also needs to be in a sitemap. Once you have all of these extra layers implemented, the robots.txt becomes utterly redundant. Hence why I say it's a failed experiment. The benefits it offers are superseded by better solutions.

cosinetau · on April 21, 2017

For whatever it's worth: http://humanstxt.org/

afandian · on April 21, 2017

I'm not so sure that even Google respects it. I did some digging about the semantics about robots.txt whilst writing a bot myself, and it seems that Google doesn't follow links that are excluded, but it will visit those pages. Maybe that counts as "paying attention", but I don't think they "respect" it.

butler14 · on April 21, 2017

they respect it, but because it's so frequently misused or plain broken, it's basically sidelined vs. more optimal methods for preventing indexation, or getting an already-indexed piece of content removed, such as the noindex tag.

dbg31415 · on April 21, 2017

I think the biggest argument for honoring robots.txt is that sites, especially old sites, can have a lot of really highly resource intensive pages. I don't want someone crawling a page that has 800+ DB calls... for example. Yes, I should optimize the page, or whatever... but really it may be useful for that admin, or 1 out of 10,000 users, who uses the page. It's not ideal to have someone crawl all those pages at once.

I think they should honor robots.txt, and the meta tag version on specific pages or links -- given the site publisher went out of their way to give instructions to crawlers it seems reasonable to honor those requests.

sengork · on April 21, 2017

Found this[1] via the Wikipedia's Talk page for the robots.txt article. It showcases that early on robots.txt was designated to help maintain bandwidth performance of web servers. Back then it would have been due to bandwidth contention, today it may be bandwidth cost to some operators which robots.txt help mitigate.

[1] https://yro.slashdot.org/comments.pl?sid=377285&cid=21554125

yeukhon · on April 21, 2017

I remember writing a dumb parser for robots.txt. I have to agree, robots.txt is simplistic but so non-standard. I wonder why search engines can't just say NO to this. Does search engines today still honor robots.txt?

Here's my shameless plug: https://github.com/yeukhon/robots-txt-scanner

I still remember writing most of this on Caltrain one morning heading to SF visiting someone I dearly loved.....

Aissen · on April 21, 2017

Finally. A bit late, since a lot of the archive has been removed because of new owner's aggressive (or malicious) new robots.txt

pjc50 · on April 21, 2017

Hidden, rather than removed.

Aissen · on April 21, 2017

I hope so.

6d6b73 · on April 21, 2017

There should be a way to direct archiving bots to a file that has the newest, compressed version of the website for them to download. Wouldn't that be easier for everyone?

BHSPitMonkey · on April 21, 2017

Seems like it would just get abused with false content

6d6b73 · on April 21, 2017

True, but maybe it would be a good option for non-commercial websites that would like to get archived, and make the archiving more efficient for themselves and archive.org

droithomme · on April 21, 2017

I have some sites where I specifically block archiving from some sections for good reason. (Even if I didn't have a good reason though it would still be my choice.)

I have a very big problem with them disregarding robots directives. Sure some crawlers ignore them: Hostile net actors up to no good. This decision means they are a hostile net actor. I'll have to take extreme measures such as determining all the ip address ranges they use and totally blocking access. This inconveniences me, which means they are now my enemy.

edit- For those interested: Deny from 207.241.224.0/22

omgtehlion · on April 21, 2017

I have easier solution for you: just shut down your site and be done with it. This way no malicious actor will be able to save your precious information.

andrius4669 · on April 21, 2017

Why not just block ia_archiver useragent in your web server for these paths instead? Also, I'm curious, what that good reason could be?

cprecioso · on April 21, 2017

Can I ask what the good reason is?

Asparagirl · on April 21, 2017

Are you under the impression that individual web archivists don't also scrape websites of interest, and submit those WARC's for inclusion into the Wayback Machine, independent of the IA's crawlers?

Because believe me, we do...good luck banning every AWS and DO IP range.

droithomme · on April 21, 2017

Thank you for the tip. I wasn't aware of that, but it was not a problem to update the rules to account for the full AWS range based on the new information. I greatly appreciate your feedback. I am not sure what DO is though, would you be so kind as to deacronymize that for me, thank you.

Asparagirl · on April 21, 2017

We also run crawlers on our home laptops, on university servers, on every cheapo hosting service we can find (especially if they offer decent or "unlimited" bandwidth), and so on. Tools like wget and wpull can randomize the timing between requests, use regex to avoid pitfalls, change the user-agent string, work in tandem with phantomjs and/or youtube-dl to grab embedded video content...

Good luck playing whack-a-mole against the crawlers. I admit to being very curious what you're openly hosting online that you really don't want to get saved for posterity?

heinrich5991 · on April 21, 2017

DigitalOcean.

toast0 · on April 21, 2017

Banning AWS and DO is pretty simple for those who care. If you're oriented towards people and not automation, you don't get a lot of false positives, but there are some real people behind proxies in AWS/DO.

tomjen3 · on April 21, 2017

I actually didn't know that. Do you operate the same crawlers?

I have considered putting a single file that is only accessible via no-follow links and perma-ban any ip that access the file, as a way to punish bad robots.

bigbugbag · on April 21, 2017

FWIW humans happen to be able to choose their user-agent at will.

Not so long ago changing your user-agent to one of the search engine bot as a simple workaround for some paywalls that appeared in search results was a thing.

It's also part of the techniques used to give extra privacy and messing with fingerprinting. For example random agent spoofer: https://github.com/dillbyrne/random-agent-spoofer

amelius · on April 21, 2017

I think we should write down a legal license in our robots.txt file, as a retribution for all those lengthy EULAs these big companies make us read :)

madshiva · on April 21, 2017

Yeah just ignore robots.txt because there's others solution.

If the site don't want to be scanned they can adopt a lot of counter measure and robots.txt will not save it from abuse.

He remind me the old days when my website wasn't working from US because I just fake that the site was down because there's no reason that somebody goes to my site from US (I know it's kind stupid, but when all your content is in french and you are a kid... :) )

rubatuga · on April 21, 2017

One thing that should be considered is the right for an individual to be forgotten.

TeMPOraL · on April 21, 2017

In my (current) opinion, it's this law that should be forgotten. What's on the public Internet is a matter of public interest. All I can see is this law being used by bad people to hide their bad deeds, especially when those bad deeds should be known.

Senderman · on April 21, 2017

At the risk of going off-topic, I loved your usage of the word 'current' before 'opinion', and I'm going to adopt it.

freshhawk · on April 21, 2017

Forgotten by whom? Who judges what is "forgotten" and when? How is it enforced?

The specifics here matter a great deal, the versions so far are regularly abused by the wealthy and don't apply to any of the data warehouses that the powerful and well connected have access to.

Where did this "right" come from? What's the legal and ethical basis for it? It is analogous to censorship or book burning at the basic level, destroying information to hide it from the public. It requires a consistent and strong justification as well as justified limited scope because of that, and it better be obviously beneficial to society even accounting for the inevitable misuse by those in power.

beejiu · on April 21, 2017

There is a legal basis in the European Union. https://en.wikipedia.org/wiki/Right_to_be_forgotten

merb · on April 21, 2017

and the legal basis is pretty fluffy: - http://www.sueddeutsche.de/digital/bgh-grundsatzurteil-namen... - http://www.focus.de/digital/internet/bgh-urteil-keine-staend...

it's german, but basically it says: "This is not a blank check".

driverdan · on April 21, 2017

That's not a thing outside the EU.

bigbugbag · on April 21, 2017

What about it ? do you want us to burn down the national archives that have a copy of every newspaper and are open to public ? Guess what! the internet gives you access to your own press, allows to write into other newspapers where possible and archive.org get an archive of this public publishing. The right of an individual to be forgotten is hardly prevalent on human history or obligation to remember.