Hacker News new | past | comments | ask | show | jobs | submit login

It appears that IA applies (or did apply) a new version of robots.txt to pages already in their index, even if they were archived years ago. That's silly, and stopping doing that would probably solve much of this problem.



I first came across that issue during one of the many Facebook privacy scandals. I'd found some juicy bits in (IIRC) a much earlier version of their privacy policy. But when I went back to it later, the robots.txt had been updated, and the earlier archives had been obliterated.

That just seems wrong.


Next time resave the page on some other service like archive.fo.


Is the only difference between archive.is and archive.fo is https?


I believe archive.is is also on https.


I don't think it's "silly". IA operates in a sketchy legal environment. There's no fair use exclusion for what they're doing, and it made sense to be extra careful and deferential towards website operators, lest they get hit by a lawsuit.


> There's no fair use exclusion for what they're doing

Fair use is not the only exception to copyright. US copyright law has a separate section on exceptions for libraries and archives.


There are [1] but they seem to be pretty rooted in making a copy of original works in a physical library. The parent's point that the IA operates in a very grey area of law and therefore needs to bend over backwards to comply with requests to remove material still applies.

[1] https://www.law.cornell.edu/uscode/text/17/108


Ask yourself what you would rather have the IA spends it's meager funds on: buying hardware and paying people to do critical work or paying a bunch of lawyers to fight lawsuits against much better funded opponents they would lose anyway.


I asked, and the answer was: "it's important to fight these fights, which is why I'm donating to the ACLU".

I believe Us Code 108 is relevant here. It starts:

    it is not an infringement of copyright for a library 
    or archives, or any of its employees acting within the
    scope of their employment, to reproduce no more than 
    one copy or phonorecord of a work[...]
There's obviously more to it that I haven't done research on, but that's a pretty good start and I wouldn't worry too much about lawsuits. In fact, if they were at risk of lawsuits, I don't see why respecting robots.txt would stop them–there's no "but you didn't tell me not to" excuse in copyright.


If someone wanted to sue the Archive, they would probably argue that every time archive.org serves a file they are making a copy... which is true, after all, if anything reproduced digitally is a "copy" in that sense.

Nice point about the lack of implied permission in copyright. It makes me think robots.txt probably doesn't have any meaning one way or the other legally, but is just a community thing.


> If someone wanted to sue the Archive, they would probably argue that every time archive.org serves a file they are making a copy... which is true, after all, if anything reproduced digitally is a "copy" in that sense.

It's more than a theoretical point - that each "serving" of a file is a copy is well established legally. In fact, even loading a program to RAM was considered a copy, per MAI Systems Corp. v. Peak Computer, until Congress made an explicit exception.


And that exception only applies for people doing maintenance on your computer.


>If someone wanted to sue the Archive, they would probably argue that every time archive.org serves a file they are making a copy... which is true, after all, if anything reproduced digitally is a "copy" in that sense.

It absolutely could be/would be argued. Otherwise an arbitrary library or archive--oh, lets give it a name like Google Books--would have the right to make digital copies of physical books available to the public. Obviously Google tried to do this and (although the case was/is complicated) they weren't allowed to do this unconditionally.

ADDED: Or, heck, any site could declare themselves an archive and offer up ripped CDs to the public.


> I asked, and the answer was: "it's important to fight these fights, which is why I'm donating to the ACLU".

The ACLU and IA are two different entities, donating to the one does nothing to help the other.

> I believe Us Code 108 is relevant here.

Yes, it is.

> There's obviously more to it that I haven't done research on

Glad we got that out of the way.

> but that's a pretty good start and I wouldn't worry too much about lawsuits.

Well, since you're not operating the archive it isn't you that should be worried. And given that 'there is more to it that you haven't done research on' it is probably fair to say that lack of such worries thereof is a bit premature.

> In fact, if they were at risk of lawsuits, I don't see why respecting robots.txt would stop them–there's no "but you didn't tell me not to" excuse in copyright.

Because it shows effort on their side to not collect when copyright holders make a minimum effort to warn outside parties not to collect their data.

In the eyes of a judge - or a half decent lawyer - that will go a long way towards establishing that the archive made an effort to stay on the bright side of the line.

Law is interpreted, the fact that there is no such provision in copyright law doesn't mean that a judge isn't able to look past the letter and establish intent. If you are clearly in violation and refuse to do even the minimum in order to avoid such violations then judges tend to be pretty strict, in other words, they'll throw the book at you. But if you can demonstrate that you did what you could and that the plaintiff did not make even a minimum effort to warn others that archival storage or crawling is not desired then their case suddenly is a lot weaker.

See also: DMCA and various lawsuits in lots of different locations, the internet is far larger than just the USA and there are a number of interesting cases around this subject in other countries, some of those had outcomes that were quite surprising (at least, to non-lawyers).

I've copied Geocities.com when it went down and have had quite a bit of discussion with IP lawyers on the subject. So far I've been able to avoid being sued by responding timely to requests by rights holders. But that doesn't mean they would not have standing to sue me and if they do I might even lose.

This is not at all a settled area of the law and if you feel that the Internet Archive is in the right here no matter what then you could of course offer to indemnify them from any damage claims.


No help there, the exception is very limited in the number of copies it may produce, among other factors: https://www.law.cornell.edu/uscode/text/17/108


A "copy", in this context, is a file. They can have three of those, which aligns perfectly with standard backup practices. Serving them is distribution, but not copying.


Yes, it is. Even loading a file to RAM was considered a copy (see MAI Systems Corp. v. Peak Computer, Inc) until Congress made an explicit exception.


What do you mean by sketchy legal environment ?

Couldn't they move operation to a non sketchy one, IIRC they anticipated the need for such a move due to trump and now have a backup ready in a different country.


I'm not saying the US is sketchy, I'm saying what they do is legally sketchy, considering copyright (which exists in the whole world). Though it's possible that some countries have archival exceptions that would cover them, I don't know.


How does this relate to the robots.txt file? Even being deferential it doesn't make much sense to respect it.


Upvoted. That's exactly the problem, and the way to solve it.

Example: two months before the movie "The Social Network" got released to theaters in 2010 Facebook decided to add a robots.txt to Facebook.com. Immediately Archive.org deleted/disabled access to the archive how Facebook startpage looked in 2004-2010.

BTW. the correct way would be to activate archive access to Facebook.com for the 2004-2010 time-frame again. The "The Accidental Billionaires: The Founding of Facebook" book and the "The Social Network" film based on that book used of course partly Archive.org and various other research methods to get the facts.


What about extending robots.txt to include date ranges?

For future domain-owners this is likely far too much control, but maybe that could be mitigated if IA tracks DNS/whois/registration info too


If we did this, the date range would be set to "Forever" 99.99% of the time.


yeah it's not even hard to solve of a problem, just use the archived version of robots.txt that matches the crawl date

too bad they already lost loads of internet content that way


They didn't lose anything. Content excluded in this manner is only made inaccessible to the public, not deleted from the archive. They can change their policy retroactively.


It may be possible technically. I doubt it's possible in reality: they've probably made promises to a bunch of people over the years that this is how it works. People furious that Their Content is appearing somewhere else and go straight into lawyers on the first email.


> It may be possible technically. I doubt it's possible in reality: they've probably made promises to a bunch of people over the years that this is how it works. People furious that Their Content is appearing somewhere else and go straight into lawyers on the first email.

They won't be furious when they're dead.

I think the main value of the Internet Archive is not so much in the near term, but in the long term. I hope in the future they enact some policy that ignores any robots.txt for scrapes older than, say, 50 years.


What should happen in the case that a website misconfigures robots.txt and ends up wanting to remove private data?

I think I would be tempted to say that the data can't be removed to avoid abuse from future domain owners (or current ones) but I'm not sure if there would be any legal consequences of this attitude.


Provide a content removal form? It works for DMCA notices, it can work here. Maybe even have a 'reason' textbox to see why someone may want content removed...


Of course, there's the inevitable risk that the Internet Archive's newfound control over who is allowed to make their past disappear into the memory hole and who has it archived forever will be used for political ends, especially since the ability to manually archive pages is already used this way by staff. (Take a look at Jason Scott's Twitter or that of the Archive Team sometime - lots of conspicuous manual archiving of stuff that's embarrassing to a certain US political party.)

The issue of curators' views biasing the contents of collections seems to be underappreciated in general in the digital age, for some reason.


Just to idly correct you.

Archive Team (not a part of Internet Archive) actually archives piles and piles of web-based material, sometimes in response to current events, sometimes because of known shutting down of services, and sometimes because of speculative worry about longevity. (For an example of the last one, we've been archiving all current FTP sites left.)

Meanwhile, Internet Archive's crawlers are bringing in millions (really millions) of URLs every day, just constantly grabbing websites, files, video, you name it.

There's certainly a "bias" to the current administration in terms of 1. They're in power 2. They keep removing things new and old. But think of it as us having a few lights shined in specific directions while thousands of other floodlights go literally everywhere.


Here I'd lean towards archiving everything indiscriminately. Politicians especially should not have the "right to be forgotten", because what they do is of historical interest.


To clarify the nature of the distortion you are referring to, it would be a sampling bias.

In general, the archive spiders the web and ingests information so that there is a certain mean frequency of visits and a certain likelihood of any particular revision of a web page being captured.

There would be instances in which data was entered into the archive more certainly and more frequently, on the basis of the nature of that data, than otherwise would have occurred.

What one means by bias when one says that this biases the contents of the collection needs to be understood with some care. It would be interesting to hear some historians discuss the matter. I do not think that it is a type of bias that is likely to lead them very far astray.

If it mollifies your concerns any, the last time I checked, anyone could manually archive any web page they liked. However, I would recommend writing to The Archive to express your concern.

I have an entirely partisan appreciation of the ability of The Archive to prevent redactions from the historical record of material that might later be disavowed. However, I share your more general view that there is no reason that the online history of any single major U.S. political party should be documented any less carefully than any other other.


Same when an archive crawl illegal/copyrigthed data, a pathway for that needs to exists anyway


Another solution might be including http://archive.org in their archives.


It's turtles all the way down!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: