Internet Archaeology: Scraping time series data from Archive.org

hartator · on April 5, 2017

Really cool, congrats!

I have built something similar, but to retrieve a backup for one of my dead websites. It was a fun project.

Shameless plug: https://github.com/hartator/wayback-machine-downloader/

natch · on April 5, 2017

Do they no longer have a program like they used to where researchers can apply for direct access to the crawl data?

mdaniel · on April 5, 2017

Are you thinking of http://commoncrawl.org or is/was there actually an Internet Archive program like that? Because that would be amazing

natch · on April 6, 2017

Funny enough, the only way I found info on this now was to go back through the wayback machine to find old versions of archive.org...

Here's a page with some tantalizing information, but I'm gathering from the lack of current info that maybe this access is a thing of the past:

https://web-beta.archive.org/web/20060209225202/http://www.a...

Clicking through to an item on the sidebar, it's clear these were actual UNIX logins made available to researchers with approved projects:

"Research.archive.org houses the personal files of the users on the system. Each user has access to the directory /home/<login> for file storage. Since research.archive.org is NFS mounted on all of the hosts, a user's home directory <blah blah blah>...

...Individual hosts can be accessed using the remote shell (rsh) UNIX command. The hosts in the cluster have an auto-authenticating script, so the secure shell (ssh) command is unnecessary. Access to the hosts is limited depending on the type of user account that is held. User accounts directly on research.archive.org have access to..."

natch · on April 6, 2017

It was not common crawl. I applied for access at one time. I ended up dropping the ball because I got distracted by other projects, and never got approved.

As far as any relation to common crawl, I imagine there are good reasons on both sides, pro and con, for them possibly to have donated some data to common crawl maybe, but that's just speculation. My guess is they've got too much on their plate to swing yet another project like that.

You could try reaching out.. maybe they have a quiet research program still. Or maybe for donors, if that's an option.

deferredposts · on April 5, 2017

So what is the policy of The Internet Archive on this level of scraping? Do they have a rate limit in place?

foob · on April 5, 2017

Yes, they start sending 429 (Too Many Requests) responses if you don't use appropriate delays. They also provide a public API [0] which I believe is intended for automated requests of this type (as opposed to crawling the Wayback Machine website directly).

[0] - https://archive.org/help/wayback_api.php