One of the major reasons to make something like this available is not research but to get the eggs in more than one basket. If archive.org ever went offline permanently that would be a pretty big disaster.
A backup copy of the library of Alexandria would have been a nice to have before it burned down but would have been priceless afterwards.
So please, make all of the archive available in some form. It will be an insane amount of data but at least there will be some institutions that will be able to insure this precious resource against various disasters.
Once they acquire old domains, some websites start blocking archive.org through robots.txt. It would be better if there's any elegant solution for this problem.
The internet is flooded with pop culture bullshit. The Library contained precious works from some of the greatest geniuses in history--works that we know exist but were forever lost.
On the other hand, it is still literally true that the internet also contains (although minuscule in proportion) precious works from some of the greatest geniuses in history, that you will not finish reading in your lifetime.
My point is that it's not like there's any shortage of great reading materials. Unless you want some specific materials from the library of Alexandria, it does not make much sense to miss it.
There's still people that care relatively little for the study of the past. The internet is also filled with lots of things that aren't "pop culture bullshit"
After I look at what careful study of great geniuses of the past and high culture did for us in the 1940s and ever since, forgetting the bulk of it (less technical stuff) is probably best. Most of the (intellectual) past is an albatross.
I don't know that we necessarily miss it for the information that was in it, but I think we certainly miss it for the progress we lost without it.
Electricity was discovered and even used in the ancient middle east. Steam powered perpetual motion devices were constructed, but never applied to locomotion.
Can you imagine where we would be now as a species if ideas like these were allowed to propagate across the Mediterranean thousands of years ago? Steam powered devices are only 350+ years old, and your grandpa's grandpa probably did't have electricity in his house.
In what way was electricity used? (I support your overall point that better preservation would've sped progress, and wish to know if there's something I didn't know.)
I find it hard to believe I'm reading this comment on Hacker News, if there was any group of people where I would expect the value of such a trove of knowledge about the past to be estimated at close to its true value.
CommonCrawl also has a fairly large ("The crawl currently covers 5 billion pages") dataset of this sort, which unlike the one from archive.org is already available to everyone on S3 under the requester-pays model.
'If you would like access to this set of crawl data, please contact us at info at archive dot org and let us know who you are and what you’re hoping to do with it. We may not be able to say “yes” to all requests, since we’re just figuring out whether this is a good idea, but everyone will be considered.'
This is annoying, that they're using the enterprise sales model for distribution. Just put it on S3.
"Just?" That's $9000/month for standard storage on S3 - $7000 for reduced redundancy. Each download would cost them about $4000 in bandwidth fees.
They already know how to store massive amounts of data, and how to send it over the network. Assuming $100/TB for their own media means it would only cost them about $4000 to store it themselves.
Assuming you have 1Gb/s connection rate, that would take you over 7 days to download. It's probably both cheaper and faster to write the data to disk and ship the disk then to an S3 download.
It reads more like they don't know if or how people want to use this. (The "are interested in exploring how others might be able to interact with or learn from this content if we make it available in bulk.") Simply making the data available doesn't give them feedback.
For example, is it sufficiently worthwhile for them to go through the effort of providing the data on S3, given the costs?
Talking with Amazon to make a special deal for hosting that data would not be a "just." The major point remains - archive.org knows how to host and provide large files, so the issue must be some other factor. I think they want to know if it's worthwhile to do so.
It'd be a HELL of a lot cheaper and a MUCH faster transfer rate to mail a few really large capacity hard drives full of the data instead of hosting it on S3.
edit: Just saw dalke's response. Great minds think alike!
Let me assure you, storage has gotten cheap but not that cheap. ;)
You've omitted the labor cost to assemble, debug and maintain your McGyver-Device. That's easily another $2500/mo (amortized).
Secondly you don't really want to store 80T on the cheapest components you can possibly get without a lot of testing and planning. This $22 PSU, trust me, it will come back to haunt you.
Thirdly, "decent redundancy" starts at factor 2.5, not 1.25.
And finally: If you want to put this stuff online and have people actually download it then you'll soon notice that redundancy is not only needed for availability but also for performance.
A reasonable ballpark figure for low-end networked storage nowadays is $0.05/GB per month (it gets much cheaper above 500T). Thus hosting those 80T should cost roughly $4000/mo, give or take a few.
I'd be doing this myself, so I'll charge myself $0.
> Secondly you don't really want to store 80T on the cheapest components you can possibly get without a lot of testing and planning. This $22 PSU, trust me, it will come back to haunt you.
Sure. Of course. Bump that to $45. Ok, another $200. Not huge.
> Thirdly, "decent redundancy" starts at factor 2.5, not 1.25.
If you are serving it to the internet at large. But for personal use, 1.25 is fine unless you are saying the proper RAID setup is Number of disks * 2.5; which would be something new to me, for sure.
> And finally: If you want to put this stuff online and have people actually download it then you'll soon notice that redundancy is not only needed for availability but also for performance.
I don't. The presumption is that it's a copy (my copy, actually), not the original.
> A reasonable ballpark figure for low-end networked storage nowadays is $0.05/GB per month (it gets much cheaper above 500T). Thus hosting those 80T should cost roughly $4000/mo, give or take a few.
You might be getting ripped off :-(.
I can get a half-rack (that's 22U) for $900/month. Even at 2.5 redundancy and if I had to pay for the patch and the switch, it's still way under $4000/month.
Besides, the thought experiment was to run it from somewhere like my entry-way, near my coat-hanger: "What's this? Oh, it's just the internet; the Whole Internet. No no, just a copy."
Yes, of course you can cobble something together when availability does not matter at all (it might blow the fuse in your apt, though;)).
I was just saying that in an application with most basic availability-requirements you're not getting the cost down like that.
I.e. even though you could fit that into one rack, nobody actually would (redundancy is measured in powers of >=2). And even though you might find an ISP who won't bitch about you drawing >10 Amps in "half a rack" (cough), you should still be a little concerned about other tenants screwing around in the same rack as your only copy of 80T of data that you care about... ;)
I don't see why you would use 20 machines. I think a good place to start would be RAID 5 using 6 drives so 5 for data and 1 for backup. Which gives 6 machines, assuming your dealing with uncompressed data assuming it's mostly text you can probably use 2 or 3.
How did I screw that up ... woops, let me change it.
Ok done.
I still argue 7 machines though. I mean, sure you could do USB + enclosure or have some more expensive board (with 6 SATA connectors, I don't know how cheap those go). Then you also may need more power, depending on how you use the thing. It's true that fewer machines generally = fewer faults, just as a matter of statistics.
But in reality, in practice, the users will probably have some IBM or SGI solution that is a full-height rack with a bunch of SAS drives or something. I'm sure you've seen those things at trade shows.
But my point here was to try to determine how much it would cost with total baseline OTS hardware.
A backup copy of the library of Alexandria would have been a nice to have before it burned down but would have been priceless afterwards.
So please, make all of the archive available in some form. It will be an insane amount of data but at least there will be some institutions that will be able to insure this precious resource against various disasters.