Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Does anyone have a backup of Aaron Swartz' site: theinfo.org?
141 points by skoocda on Dec 7, 2021 | hide | past | favorite | 9 comments
Hi HN folks,

I was browsing Aaron's blog [http://www.aaronsw.com/weblog/rawnerve] when I noticed the footer links out to http://theinfo.org/ (last updated March 2008). However, none of the links are still alive, and I couldn't find any pages that were backed up on Archive.org.

I figured this would be the best place to ask - does anyone have an archive of these pages?

Holy hell. It's incredibly inspiring what aaronsw was able to do. It's a list of links, and I kept scrolling expecting to run into text. Nope. Just a gigantic list of links, and choosing a random one seems to result in actual data: https://web.archive.org/web/20090410081645/http://www.rdfabo...

Well, almost actual data. The 4.7MB tarball link is broken, but the author notes:

> For the detailed Census statistics, you'll have to download the raw Census data files from the Census Bureau, my Perl script and the patch file below and run it yourself because the files are too big for me to offer as a download!

And the perl script link still works. From experience, I know how valuable that can be. The only reason I was able to help build The Pile (books3) is because of aaronsw's html2text script (https://github.com/aaronsw/html2text). Out of at least four conversion options I tried, that script was the only one that was flexible enough to be modified to spit out human-readable text at scale.

Thanks for the nostalgia. 2008 doesn't feel like yesterday anymore, but it will always feel incredibly special. That era was just different, and it's a shame that one only realizes it in hindsight. Otherwise I would've slowed down just to look around more and appreciate getting a glimpse.

P.S. although most of the data download links are dead, there is a trick to recover some of them: try to access the links on the live site. archive.org doesn't always snag tarballs, but occasionally (very occasionally) the tarballs survive till today.

You should also check the parent site itself, not just the direct url to the tarball. Sometimes you'll get lucky and the site merely went through a reshuffle rather than being taken offline.

I have been asked for a donation on way back and Wikipedia this morning, small donation to both. Well worth the few dollars.

wait, he actually used CKAN and now the US government is also using it !? How horrifically ironic.

Working backwards:

Here's a snapshot from October 2021: https://web.archive.org/web/20211020224005/http://theinfo.or...

The updated links in it (to http://theinfo.anandology.com/) seem dead though.

In 2014 there are error messages, and sometimes just court records: https://web.archive.org/web/20140215110438/http://theinfo.or...

The latest I can find of what looks like the original site is 2012: https://web.archive.org/web/20121003205028/http://theinfo.or...

The next snapshot it's replaced with the court records.

I guess that was my problem- I was looking up the links via anandology.com assuming that was where they'd be archived.


Let's not forget jottit.com. Aaron's website to make notes. It was useful back in the day.

Another place to ask these types of questions is on /r/datahoarder.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
