Hacker News new | past | comments | ask | show | jobs | submit login

It is great news in general, but seems to be done in a clumsy and counterproductive manner that may cause the Internet Archive to be banned from crawling some websites.

The problem: when robots.txt for a website is found to have been made more restrictive, the IA retrospectively applies its new restrictions to already-archived pages and hides them from view. This can also cause entire domains to vanish into the deep-archive. No-one outside IA thinks this is sensible.

Their solution: ignore robots.txt altogether. What? That will just annoy many website operators.

My proposed solution: keep parsing robots.txt on each crawl and obey it progressively, without applying the changes to existing archived material. This is actually less work than what they currently do. If the new robots.txt says to ignore about_iphone.html you just do that and ignore it. Older versions aren't affected.

Basically they're switching from being excessively obedient to completely ignoring robots.txt in order to fix a self-made problem. I can only see that antagonising operators.




Archive Team is not associated with Internet Archive. AT does not crawl the web at large, it only targets specific sites.


There's some value in allowing site operators to retroactively remove content which was never intended to be public. A common and unfortunate example is backups (like SQL dumps) being stored in web-accessible directories, then subseqently being indexed and archived when a crawler finds the appropriate directory index.

What needs to be fixed first is just the really common case mentioned in the blog post, where a domain changes ownership and a restrictive robots.txt is applied to the parking page.


Here's a slight modification to the GP proposal:

- Respect robots.txt at the time you crawl it.

- If robots.txt appears later, stop archiving from that date forwards.

- Preserve access to old archived copies of the site by default.

- Offer a mechanism that allows a proven site owner to explicitly request retrospective access removal.

If archive.org have recorded the date that they first observed a robots.txt on the sites currently unavailable, they could even consider applying the above logic today retrospectively. Perhaps after a couple of warning emails to the current Administrative Contact for the domain.


>mechanism that allows a proven site owner to explicitly request retrospective access removal. //

It should be "a proven content owner", just buying a site shouldn't allow someone to remove it from archive.


How about you respect the robots.txt until the IP address where it is hosted changes. Once the IP has changed, then any new robots.txt exclusions apply only to the new pages not the archived pages under the old IP, which continue respecting the old archived robots.txt.

The IP address changing is a pretty solid indicator that control of that content has moved to a new organisation. Note this does not always coincide with the domain name owner changing.

A scenario that I can imagine becoming litigious: company owns a domain for promoting some product and they use robots.txt to prevent copies. The product reaches end of life and domain is allowed to expire. Someone else buys the domain and starts hosting content with no robots restriction. Archive.org start to display pages from the old company. Company then sues archive.org for copyright violation.


>may cause the Internet Archive to be banned from crawling some websites.

It looks like Facebook banned ia_archiver (recently? I recall it worked a few weeks ago):

>User-agent: ia_archiver

>Disallow: /

https://www.facebook.com/robots.txt




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: