It appears that IA applies (or did apply) a new version of robots.txt to pages a...

mbrookes · on April 21, 2017

I first came across that issue during one of the many Facebook privacy scandals. I'd found some juicy bits in (IIRC) a much earlier version of their privacy policy. But when I went back to it later, the robots.txt had been updated, and the earlier archives had been obliterated.

That just seems wrong.

anonymfus · on April 21, 2017

Next time resave the page on some other service like archive.fo.

djsumdog · on April 21, 2017

Is the only difference between archive.is and archive.fo is https?

m-p-3 · on April 21, 2017

I believe archive.is is also on https.

icebraining · on April 21, 2017

I don't think it's "silly". IA operates in a sketchy legal environment. There's no fair use exclusion for what they're doing, and it made sense to be extra careful and deferential towards website operators, lest they get hit by a lawsuit.

the8472 · on April 21, 2017

> There's no fair use exclusion for what they're doing

Fair use is not the only exception to copyright. US copyright law has a separate section on exceptions for libraries and archives.

ghaff · on April 21, 2017

There are [1] but they seem to be pretty rooted in making a copy of original works in a physical library. The parent's point that the IA operates in a very grey area of law and therefore needs to bend over backwards to comply with requests to remove material still applies.

[1] https://www.law.cornell.edu/uscode/text/17/108

jacquesm · on April 21, 2017

Ask yourself what you would rather have the IA spends it's meager funds on: buying hardware and paying people to do critical work or paying a bunch of lawyers to fight lawsuits against much better funded opponents they would lose anyway.

matt4077 · on April 21, 2017

I asked, and the answer was: "it's important to fight these fights, which is why I'm donating to the ACLU".

I believe Us Code 108 is relevant here. It starts:

    it is not an infringement of copyright for a library 
    or archives, or any of its employees acting within the
    scope of their employment, to reproduce no more than 
    one copy or phonorecord of a work[...]

There's obviously more to it that I haven't done research on, but that's a pretty good start and I wouldn't worry too much about lawsuits. In fact, if they were at risk of lawsuits, I don't see why respecting robots.txt would stop them–there's no "but you didn't tell me not to" excuse in copyright.

biztos · on April 21, 2017

If someone wanted to sue the Archive, they would probably argue that every time archive.org serves a file they are making a copy... which is true, after all, if anything reproduced digitally is a "copy" in that sense.

Nice point about the lack of implied permission in copyright. It makes me think robots.txt probably doesn't have any meaning one way or the other legally, but is just a community thing.

icebraining · on April 21, 2017

> If someone wanted to sue the Archive, they would probably argue that every time archive.org serves a file they are making a copy... which is true, after all, if anything reproduced digitally is a "copy" in that sense.

It's more than a theoretical point - that each "serving" of a file is a copy is well established legally. In fact, even loading a program to RAM was considered a copy, per MAI Systems Corp. v. Peak Computer, until Congress made an explicit exception.

zerocrates · on April 21, 2017

And that exception only applies for people doing maintenance on your computer.

ghaff · on April 21, 2017

>If someone wanted to sue the Archive, they would probably argue that every time archive.org serves a file they are making a copy... which is true, after all, if anything reproduced digitally is a "copy" in that sense.

It absolutely could be/would be argued. Otherwise an arbitrary library or archive--oh, lets give it a name like Google Books--would have the right to make digital copies of physical books available to the public. Obviously Google tried to do this and (although the case was/is complicated) they weren't allowed to do this unconditionally.

ADDED: Or, heck, any site could declare themselves an archive and offer up ripped CDs to the public.

jacquesm · on April 21, 2017

> I asked, and the answer was: "it's important to fight these fights, which is why I'm donating to the ACLU".

The ACLU and IA are two different entities, donating to the one does nothing to help the other.

> I believe Us Code 108 is relevant here.

Yes, it is.

> There's obviously more to it that I haven't done research on

Glad we got that out of the way.

> but that's a pretty good start and I wouldn't worry too much about lawsuits.

Well, since you're not operating the archive it isn't you that should be worried. And given that 'there is more to it that you haven't done research on' it is probably fair to say that lack of such worries thereof is a bit premature.

> In fact, if they were at risk of lawsuits, I don't see why respecting robots.txt would stop them–there's no "but you didn't tell me not to" excuse in copyright.

Because it shows effort on their side to not collect when copyright holders make a minimum effort to warn outside parties not to collect their data.

In the eyes of a judge - or a half decent lawyer - that will go a long way towards establishing that the archive made an effort to stay on the bright side of the line.

Law is interpreted, the fact that there is no such provision in copyright law doesn't mean that a judge isn't able to look past the letter and establish intent. If you are clearly in violation and refuse to do even the minimum in order to avoid such violations then judges tend to be pretty strict, in other words, they'll throw the book at you. But if you can demonstrate that you did what you could and that the plaintiff did not make even a minimum effort to warn others that archival storage or crawling is not desired then their case suddenly is a lot weaker.

See also: DMCA and various lawsuits in lots of different locations, the internet is far larger than just the USA and there are a number of interesting cases around this subject in other countries, some of those had outcomes that were quite surprising (at least, to non-lawyers).

I've copied Geocities.com when it went down and have had quite a bit of discussion with IP lawyers on the subject. So far I've been able to avoid being sued by responding timely to requests by rights holders. But that doesn't mean they would not have standing to sue me and if they do I might even lose.

This is not at all a settled area of the law and if you feel that the Internet Archive is in the right here no matter what then you could of course offer to indemnify them from any damage claims.

icebraining · on April 21, 2017

No help there, the exception is very limited in the number of copies it may produce, among other factors: https://www.law.cornell.edu/uscode/text/17/108

matt4077 · on April 21, 2017

A "copy", in this context, is a file. They can have three of those, which aligns perfectly with standard backup practices. Serving them is distribution, but not copying.

icebraining · on April 21, 2017

Yes, it is. Even loading a file to RAM was considered a copy (see MAI Systems Corp. v. Peak Computer, Inc) until Congress made an explicit exception.

bigbugbag · on April 21, 2017

What do you mean by sketchy legal environment ?

Couldn't they move operation to a non sketchy one, IIRC they anticipated the need for such a move due to trump and now have a backup ready in a different country.

icebraining · on April 22, 2017

I'm not saying the US is sketchy, I'm saying what they do is legally sketchy, considering copyright (which exists in the whole world). Though it's possible that some countries have archival exceptions that would cover them, I don't know.

throwaway91111 · on April 21, 2017

How does this relate to the robots.txt file? Even being deferential it doesn't make much sense to respect it.

frik · on April 21, 2017

Upvoted. That's exactly the problem, and the way to solve it.

Example: two months before the movie "The Social Network" got released to theaters in 2010 Facebook decided to add a robots.txt to Facebook.com. Immediately Archive.org deleted/disabled access to the archive how Facebook startpage looked in 2004-2010.

BTW. the correct way would be to activate archive access to Facebook.com for the 2004-2010 time-frame again. The "The Accidental Billionaires: The Founding of Facebook" book and the "The Social Network" film based on that book used of course partly Archive.org and various other research methods to get the facts.

thisacctforreal · on April 21, 2017

What about extending robots.txt to include date ranges?

For future domain-owners this is likely far too much control, but maybe that could be mitigated if IA tracks DNS/whois/registration info too

pimlottc · on April 21, 2017

If we did this, the date range would be set to "Forever" 99.99% of the time.

avereveard · on April 21, 2017

yeah it's not even hard to solve of a problem, just use the archived version of robots.txt that matches the crawl date

too bad they already lost loads of internet content that way

paol · on April 21, 2017

They didn't lose anything. Content excluded in this manner is only made inaccessible to the public, not deleted from the archive. They can change their policy retroactively.

patmcguire · on April 21, 2017

It may be possible technically. I doubt it's possible in reality: they've probably made promises to a bunch of people over the years that this is how it works. People furious that Their Content is appearing somewhere else and go straight into lawyers on the first email.

Chaebixi · on April 21, 2017

> It may be possible technically. I doubt it's possible in reality: they've probably made promises to a bunch of people over the years that this is how it works. People furious that Their Content is appearing somewhere else and go straight into lawyers on the first email.

They won't be furious when they're dead.

I think the main value of the Internet Archive is not so much in the near term, but in the long term. I hope in the future they enact some policy that ignores any robots.txt for scrapes older than, say, 50 years.

tempay · on April 21, 2017

What should happen in the case that a website misconfigures robots.txt and ends up wanting to remove private data?

I think I would be tempted to say that the data can't be removed to avoid abuse from future domain owners (or current ones) but I'm not sure if there would be any legal consequences of this attitude.

CM30 · on April 21, 2017

Provide a content removal form? It works for DMCA notices, it can work here. Maybe even have a 'reason' textbox to see why someone may want content removed...

makomk · on April 21, 2017

Of course, there's the inevitable risk that the Internet Archive's newfound control over who is allowed to make their past disappear into the memory hole and who has it archived forever will be used for political ends, especially since the ability to manually archive pages is already used this way by staff. (Take a look at Jason Scott's Twitter or that of the Archive Team sometime - lots of conspicuous manual archiving of stuff that's embarrassing to a certain US political party.)

The issue of curators' views biasing the contents of collections seems to be underappreciated in general in the digital age, for some reason.

textfiles · on April 21, 2017

Just to idly correct you.

Archive Team (not a part of Internet Archive) actually archives piles and piles of web-based material, sometimes in response to current events, sometimes because of known shutting down of services, and sometimes because of speculative worry about longevity. (For an example of the last one, we've been archiving all current FTP sites left.)

Meanwhile, Internet Archive's crawlers are bringing in millions (really millions) of URLs every day, just constantly grabbing websites, files, video, you name it.

There's certainly a "bias" to the current administration in terms of 1. They're in power 2. They keep removing things new and old. But think of it as us having a few lights shined in specific directions while thousands of other floodlights go literally everywhere.

TeMPOraL · on April 21, 2017

Here I'd lean towards archiving everything indiscriminately. Politicians especially should not have the "right to be forgotten", because what they do is of historical interest.

escape_goat · on April 21, 2017

To clarify the nature of the distortion you are referring to, it would be a sampling bias.

In general, the archive spiders the web and ingests information so that there is a certain mean frequency of visits and a certain likelihood of any particular revision of a web page being captured.

There would be instances in which data was entered into the archive more certainly and more frequently, on the basis of the nature of that data, than otherwise would have occurred.

What one means by bias when one says that this biases the contents of the collection needs to be understood with some care. It would be interesting to hear some historians discuss the matter. I do not think that it is a type of bias that is likely to lead them very far astray.

If it mollifies your concerns any, the last time I checked, anyone could manually archive any web page they liked. However, I would recommend writing to The Archive to express your concern.

I have an entirely partisan appreciation of the ability of The Archive to prevent redactions from the historical record of material that might later be disavowed. However, I share your more general view that there is no reason that the online history of any single major U.S. political party should be documented any less carefully than any other other.

avereveard · on April 21, 2017

Same when an archive crawl illegal/copyrigthed data, a pathway for that needs to exists anyway

jdelman · on April 21, 2017

Another solution might be including http://archive.org in their archives.

Asparagirl · on April 21, 2017

It's turtles all the way down!