Hacker News new | past | comments | ask | show | jobs | submit login

Do they no longer have a program like they used to where researchers can apply for direct access to the crawl data?



Are you thinking of http://commoncrawl.org or is/was there actually an Internet Archive program like that? Because that would be amazing


Funny enough, the only way I found info on this now was to go back through the wayback machine to find old versions of archive.org...

Here's a page with some tantalizing information, but I'm gathering from the lack of current info that maybe this access is a thing of the past:

https://web-beta.archive.org/web/20060209225202/http://www.a...

Clicking through to an item on the sidebar, it's clear these were actual UNIX logins made available to researchers with approved projects:

"Research.archive.org houses the personal files of the users on the system. Each user has access to the directory /home/<login> for file storage. Since research.archive.org is NFS mounted on all of the hosts, a user's home directory <blah blah blah>...

...Individual hosts can be accessed using the remote shell (rsh) UNIX command. The hosts in the cluster have an auto-authenticating script, so the secure shell (ssh) command is unnecessary. Access to the hosts is limited depending on the type of user account that is held. User accounts directly on research.archive.org have access to..."


It was not common crawl. I applied for access at one time. I ended up dropping the ball because I got distracted by other projects, and never got approved.

As far as any relation to common crawl, I imagine there are good reasons on both sides, pro and con, for them possibly to have donated some data to common crawl maybe, but that's just speculation. My guess is they've got too much on their plate to swing yet another project like that.

You could try reaching out.. maybe they have a quiet research program still. Or maybe for donors, if that's an option.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: