Hacker News new | past | comments | ask | show | jobs | submit login

I think this is more likely to be unintentional. As another comment mentioned, article.is isn't affected. If you want to remove things from the Internet Archive, you can do so using your robots.txt:

https://archive.org/about/faqs.php#14

https://www.fightcyberstalking.org/how-to-block-your-website...




robots.txt is really only supposed to be used for blocking the Internet Archives first snapshot, and not to remove existing snapshots – and even this might not be the case in the future as they try to preserve most snapshots. They made a few policy changes last year[1] to how they handle robots.txt files, to handle cases where a domain is sold and a new robots.txt file would result in deleting old data among other things.

[1]: https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...


Hmm, that may be what it's meant for, but pretty sure it can currently be used to block things retroactively too. IA may still have it in the archive, but won't let viewers view it.

As happened in this case: https://news.ycombinator.com/item?id=16919017

No? The article you linked says they've stopped paying attention to robots.txt for US government and military sites, but it looks like it still retroactively removes visibility for everything else.

I guess IA could change their practices. If medium or people like them start actively using robots.txt to try to retroactively remove things from visibility in the archive, perhaps IA will change their practices/policy. I would welcome it.


Interesting. I wasn't aware that it no longer applies retroactively. Even so, medium.com's robots.txt still doesn't try to block new crawling by the Internet Archiver:

https://medium.com/robots.txt

Or via WBM for posterity: https://web.archive.org/web/20180430183503/https://medium.co...

It seems unlikely to me that they would deliberately go to this length to prevent archival, yet not attempt to prevent it happening to begin with. Furthermore, as mentioned in your link, they still accept removal requests via email.


This information is outdated, and ia_archiver now disobeys robots.txt (see https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...). You can still get your website removed from Internet Archive but you have to contact them.


One thing I learned about archive.org and robots.txt is that they never actually delete anything. If you accidentally block their bots, then in a month or so, your old content is available again. I've blocked their bots by mistake a few times, and each time my old web content is back. Not a big deal for me, I just feel silly for my old geocities style sites. :-)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: