The Internet Archive is absolutely essential to Wikipedia, whose articles are required to be verifiable to reliable sources. When pages go offline, the link rot makes it harder for articles to be verified. However, if the Wayback Machine has an archived copy, that copy can then be cited as the source and made available to readers and editors. The Wayback Machine now automatically archives every new external link added to the English Wikipedia.
The Wikimedia Foundation's budget is about 10 times that of the Internet Archive. If you see the fundraising banner on Wikipedia and want to help out the site, but don't think the Wikimedia Foundation needs the donation, consider donating to the Internet Archive instead.
> The Wayback Machine now automatically archives every new external link added to the English Wikipedia.
Links submitted to Hacker News should be auto-archived, too: I often stumbled upon dead-links [0] which otherwise had generated insightful discussion on news.yc. Adding archive next to web would work nicely.
A third party could do this if they wished. Savepagenow is one of several wrappers that can save an arbitrary URL's contents to the Internet Archive: https://github.com/pastpages/savepagenow
> A mechanism to save pages to the web archive has now been implemented. Just need to ping this URL: http://web.archive.org/save/<insert URL you want to save>.
It's been on our list to do that for some time, but this should make it much easier. Thanks!
It's been awesome to see archiving and Archive links become standard practice on Wikipedia. For a long time, links just rotted and were replaced. Then later, dead links would be switched to Archive content if it was available. But it's so much easier and more reliable to archive a page when it's referenced, and provide both links even while the page is still up.
It's one thing to automate dead link detection, but archiving defends against a lot more. Universities seem to redo their sitemaps and redirect all their old links about once a year. And reference pages get edited, which goes unnoticed for a long time and then leads to frustrating [Not in citation given] tags. It's incredible what a boon the Internet Archive has been to wiki sourcing.
Fantastic point. I personally would much rather support the Internet Archive than WF.
Also, there is a 2:1 donor match thing happening now, making the value of each donation effectively triple. Moreover if your company matches 501(c) donations (mine does, at least up to something like $20k a year), you can have them match your donation too, effectively giving the archive a 4x your original donation.
> Moreover if your company matches 501(c) donations (mine does, at least up to something like $20k a year), you can have them match your donation too, effectively giving the archive a 4x your original donation.
Wouldn't it be 6x, assuming your company could have their donation tripled as well?
If the matching helps persuade people to donate, I'm all for it.
The Internet Archive is a world wide treasure. Please donate!
And it's really easy to set up an automatic monthly donation as I have done. One time donations are great. Ongoing income helps insure that this precious resource can continue to fulfill their mission.
If the donation is contingent on someone else donating first, semantics aside, it’s the same thing. My employer wouldn’t donate to the Internet Archive in the amount that I gave unless I did first.
Earlier this year, I created a simple "Donate to the Internet Archive" logo for a website and e-book that I compiled from 19th- and early 20th-century books archived by IA. The logo links to the above IA donation page. Perhaps others would like to add the logo and link to their sites:
It is immensely important, especially as there is a strong trend to hide the content behind the paywalls and registration pages. Even if something is accessible today...
And there are always forces who have an interest that some inconvenient data just disappear.
I've discovered their efforts to archive old dos games... fully playable in the browser through a dosbox build in WASM as far as I know. That's a really impressive cooperation of very old and very new technology - 16 bit up to the edge of JS (- edit: or rather, the edge of browser based computing).
And on that train of thought, I just had a little flashback about storage sizes. There was some time when 3.14 MB was a big unit of measurement, and some 50MB drive was huge. The classical example: Monkey Island, Indiana Jones or Day of the Tentacle on ~12 floppy disks. But you had to choose which 1 or 2 to install because you didn't have enough space on your hard drive, or you had to swap disks every few screens :)
Or, they are archiving a lot of video playthroughs in the lets-play style from video sites, in case those sites go through a meltdown like viddler does.
I guess I'm rambling. Point is: This isn't just a storage dump. There are also interesting projects around the internet archive to make the old things accessible on new systems. Very worthwhile donating to.
I don't get why you are being downvoted. Are people not familiar with the expression "get out of here" used in a friendly manner? Or do they really downvote you because they think it is so bad that you think that 1kB should be equal to 1000 bytes? I mean, I don't agree with you either but IMO the downvote should only be used for things that are either factually wrong, or which detract strongly from the conversation, or which contribute nothing at all. I think your comment is nice and it does not deserve to be downvoted.
Yeah, apparently downvotes can be used as a "disagree" signal. Usually when I see a comment being unfairly downvoted that way, I upvote as a counter-balance, even if I disagree with the argument (i.e., how many bytes in 1kB).
2.88MB was supported by most drives and was usable under Linux. You were supposed to use proper 2.88MB floppy, but you could usually get away with using a 1.44MB floppy and reformatting.
I'm not entirely sure how well they were supported under Windows though.
IIRC the Windows 95 install floppies were 2.88MB. With that in mind it's probably safe to assume that they were at least readable even in DOS, although I don't recall being able to format or write to a floppy at that capacity in any version of DOS/Windows.
A few month back I authoritatively stated that floppies allowed for 3.14Mb, understanding my mistake only a few minutes later and being thoroughly ashamed. Now I now there's some reason behind it, not just a random glitch of memory.
There is an interesting podcast with a "Chuck" Somerville interview. He started on the Apple ][. Its pretty fun walk through the industry in the 80s, and touches on getting the "Chips Challenge" name back from the group that bought it.
Most of that effort to archive games is not IA. It is from 2 other groups who sometimes work with IA if they all are not fighting. Specifically one group is the one going through all of them and configuring them to work correctly in DOSBox. Another is just going through and just cataloging them. Getting the metadata on that stuff is kind of tough now. Like what was the exact date a game came out on? Does it have artwork still? Is there some odd protection scheme going on? Is there an IMG file for the disks? Or is it just some random pile of files? Can you get a copy from ebay? etc etc etc. IA typically notes who is doing the archive work. IA also respects takedown notices for these games. As many of them sort of came back from the dead and are sold again, or the company just does not want its IP on some random site. A good portion though no one really knows who owns them anymore. Some have clear lineage. Some have been sold over and over to random companies and depending on contracts no one knows.
That 'scene' is also full of drama and very insular. Donate to IA to help them build better infrastructure and better search. But donating because they host old abandoned DOS games I would not call a good reason, and misplaced. Because the people doing the majority of the work are doing it because they like it, not money. IA does have some work around that such as getting dosbox and mame running in a browser. Just be clear on what you are donating to.
I encourage everyone to also do their own thinking of what did you enjoy in the internet or computer software in the 80's, 90's and 00's and see if it is still available somewhere.
I found out that the local games scene I used to love as a kid had been almost wiped off the face off the earth. These were small non-commercial and shareware games localized in just one language so it was already a niche. The free hosting services of the 90s are gone so those sites are down and no one wanted to keep paying for hosting for 20 years+ on sites that get very few visitors nowadays.
The only way to get these games again was to find a Discord group and a friendly stranger who agreed to seed a torrent (which had 0 seeds when I found it). I'm looking to upload them to a couple of different places and compile a basic website catalogue (static site on CDN) one of these days. For the layperson, these games are already gone from the internet.
The internet archive does a great service but it is breadth-first and quite surface level. The depth has to come from people who were familiar with the sites at their peak. And there's a big change that no one is doing that for your specific interest.
BlueMaxima's Flashpoint is a fabulous archival project that is saving as many Adobe Flash games/animations as they can before browsers pull support at the end of 2020. Really cool, since Flash games are a similarly concrete slice of culture/history that will just be gone if they're not archived.
Torrents can be seeded from HTTP sources, known as a WebSeed. I've done it with Dropbox and they don't care as long as they don't get a copyright infringement notice. Easy way to use free storage as a CDN.
YouTube has many channels with playthroughs of old games as well. (old for me is the 90's and early 00's since i'm in my twenties) You can find things like the old Lego PC games, old educational games like Physicus/Bioscopia, kids' games like Putt-Putt (not in the least because they have a speedrunning community), and much more. It's been a great resource to go back to the stuff I played growing up.
Although efforts like internet archive are noble (and I find it occasionally useful), I'm not sure it's always so great that everything anyone does online will be permanently archived.
I know many people feel that everything should be available forever. But for me... it's pushing me away from doing much on the web. I liked it in the 90s when things were more ephemeral. When you could make mistakes and not have them easily found by anyone with a few clicks, forever.
You're right, let's burn the library down because one book has a liable chapter in it.
This argument is so horrible as to be actively harmful to Archive's work. Jason Scott is a god, and if we didn't have him, we'd have to invent him.
WE DO NOT GET TO CHOOSE WHAT THE FUTURE FINDS INTERESTING.
We live in the only point in human history where we can actually save all of humanity's knowledge and culture, and we can do so without having to worry about physical space or staff to work the "library." It's a remarkable time we live in, and yet, 99% of our society either doesn't care, thinks this work is stupid, or actively works against it through horrific copyright laws.
We know more about how Rembrandt painted and lived than we do about how Atari 2600 programmers worked and lived. I can go to Rembrandt's house and see where he lived, where he painted, how he worked, where he slept and ate and mixed his paints and taught his classes.
Atari's old HQ is just another office building. The source code to those games is mostly gone (thankfully, it's assembly and easier to disassemble). We need to save our culture and digital heritage, else we forget where we come from.
Deleting some old tweets is one thing, but actively worrying about Archive's work is just harmful to us all. We need 10,000 more Archives, dammit. It's supremely important work that is helping stem the tide of lost culture due to stock market forces. Geocities is gone forever because Yahoo! didn't find it profitable. This cannot keep happening.
I’m not convinced it’s dangerous to explore whether there are benefits to ephemerality.
I’m also not sure your Rembrandt example shows what you suggest it does. The average Atari 2600 programmer would be more equivalent to the hundreds of now unknown artists in Rembrandt’s time. The John Carmack’s of today will be remembered in detail with or without blanket archive efforts.
Maybe, just maybe, Rembrandt’s Status in our minds is a result of generations of people each seeing the individual value in his work. That is, each generation does indeed get to decide what future generations remember. Or at least it used to be true until the digital age.
Maybe the change is an improvement. But maybe not.
And libraries are the epitome of what you’re fighting against. They are by definition works chosen by humans based on judgment calls of their perceived value.
Let’s at least acknowledge that blanket archive efforts are a fundamental change in themselves and a departure from the human status quo for thousands of years. Then let’s debate whether the change is an unabated good.
While I don't endorse your parent's over-the-top rhetoric, and I do agree that there is value in ephemerality and that it's worth noting that libraries are more carefully curated than a dump-and-archive, I think it's also worth noting that these are generally public pages.
All the stuff Tumblr users intentionally wrote and published publicly, but none of their IP address logs and other incidentally collected information, is exactly what ought to be archived and preserved, in my opinion. This is in strong contrast to incidentally collected data including clear PII like IP addresses that many companies today are hoarding forever, when they ought to be ephemeral.
Tumblr blogs often include people's names, faces, and/or details about their personal lives. That's very much personally identifiable information! And while they did post it publicly, they likely didn't do so with the intention of it being saved forever in a publicly available, easily searchable archive. This especially applies for porn blogs where people post their own original content.
There's certainly value in archiving social media but I think it has to be balanced against the harms, instead of defending the practice with literal religious fervor and dismissing all criticism out of hand.
Was there something in particular I said that you felt was defending it with "literal religious fervor and dismissing all criticism out of hand", or were you referring to my grandparent? I don't think I dismissed anything out of hand, I specifically acknowledged both the value of ephemerality and the point that traditional libraries are curated.
I agree that there is a danger that people may not realize how public and permanent the things they published to Tumblr were, or how dangerous it can be to do so (and I downvoted a sibling comment dismissing this danger). However, I think you and I have different threat models.
In my mind, archiving PII that is intentionally published is not particularly harmful because most lay people do, in fact, understand that their avatar, username, and by default, posts are public on Tumblr. They have had the opportunity to remove that information this whole time, and they still do, Archive.org removes stuff if you ask them.
By contrast, lay people have no mental model for what kind of information is incidentally collected nor how dangerous or benign it is. Certainly, lay people also can and do misjudge how public and how dangerous the things they intentionally publish are, but the gap is far, far less than incidental information. "Would you tell a stranger this" or "would you write this on a bathroom wall" are decent heuristics: the only difference in danger between text written on a bathroom wall and written on Tumblr is due solely to the potentially wider reach and possibility of even going viral on Tumblr. (Photos, of course, can also subtly compromise privacy in ways surprising to a lay person, but the gap is still much smaller than incidental information.)
In my threat model, that gap in understanding is much, much more dangerous than the intrinsic danger of PII. That's why I think that as long as Archive.org has a usable removal process, I think pretty much all the danger is in surveillance capitalism's collection of incidental information, not Archive.org's permanent record of intentionally publicized information.
The reason we fight against censorship (which is what this debate comes down to) with literal religious fervor is because that's how the other side fights for it.
Don't want it archived forever? Don't put it on the Internet. Seems simple enough.
If Archive.org had your attitude, I would actively oppose it. Removing private, personal info is not censorship. And nothing about "just don't put it on the Internet" is simple. What if someone hacked your devices and then put it on the Internet for lolz? What if you shared it in confidence with someone you trusted, who is intentionally putting it on the Internet to hurt you? What if you accidentally pasted the wrong thing or uploaded the wrong file? What if you were a child and didn't understand the dangers?
There obviously should be ways to ameliorate your mistake, which is why it is absolutely critical that Archive.org has a removal process.
Many people writing personal diaries/letters probably didn't do so with the intention of it being saved forever in a publicly available, easily searchable archive.
Yet such data is invaluable to historians and can give us a window in time through the eyes of people who lived that time. Having that publicly available data lost for all time would be an immense loss to future generations.
I'm sure in a few generations, some historians will study those archived porn blogs and get an insight on the evolution of humans' relations to sexuality that today's historians can only dream of.
Ironically, IP addresses are probably the _least_ personally identifiable bit of information in a lot of that stuff. Most people's IPs are assigned to someone else within months, or even hours. But a username, profile picture, etc? Those are potentially identifiable.
In a reply to your sibling I explain how in my view, the fact that lay people have no mental model of what kind of information can be incidentally collected and how dangerous it is, whereas lay people are much more capable of understanding the dangers of a personally identifiable username, profile pic, and personal details revealed in posts, makes the former far more dangerous than the latter.
> The John Carmack’s of today will be remembered in detail with or without blanket archive efforts.
Sure, but this leaves us a distorted view of history, where we have lots of details on the lives of "great men" and next to none on how ordinary people lived. Which means the vast majority of people who lived and died in that period end up written out of their own history.
Archaeologists spend a lot of time rooting around in ancient rubbish piles and cesspools, because these are some of the very few places where physical evidence of how ordinary people lived has survived. Nobody in ancient times would have nominated those sites as culturally important or worthy of preservation. But what we know of how ordinary people in those times worked, played, ate and drank comes largely from things dug up from them.
I certainly am sympathetic to the preservationist mindset. OTOH, even if we restrict ourselves to content that is natively created in digital form, the amount of "stuff" that comes into existence every day--much of it not on the public web or public social media--is staggering. (And much is not public for good reasons.)
I'm not convinced that we should feel a compulsion to save all of that. Just because it's more practical to be a pack rat about digital content doesn't mean that, taken to extremes, it doesn't still seem like being a pack rat.
In 2008 I found a parcel of bare EPROMs at a flea market container 27 games. 1 of those games was Cabbage Patch Kids Adventures in the Park, and it was spread across 12 chips, each one showing a progressive state of development across 9 months.
To my mind, this was the only known find of a vintage Atari 2600 game and its iterative development process. So, 30 years later, the only reason we had this snapshot is because someone found these chips and sold them at the flea.
The current state of digital preservation is abhorrent. Those roms would have taken up less than 1/4 of a 5.25" floppy, but the company behind them never thought to preserve that information or data.
Take2 Interactive republished BioShock in 2012. They couldn't find their source code. They didn't save it. They had to go machine to machine looking for it. The reissued game is not the same as the original.
As a society, we don't place any value on this stuff, but the potential value of it cannot be understood until the future has occurred. Letting it vanish is a disservice to the future. In the past, if a book was published, it wasn't going to vanish if the publisher went out of business, there would simply be no new copies.
In our digital online age, things vanish in seconds, days and hours. This is also a very different state of affairs. In the past we could not save everything, but everything didn't have a clock counting down from the end of the quarter over its head, counting the seconds until it is deleted.
The Library of Congress tries to save everything. Yes, libraries weed the stacks and choose items to host. This is due to space concerns: they can't host everything ever. Digitally, they can, and many host reams of microfilm and old newspapers because they can.
Libraries can, thanks to tech, now host every book ever, digitally, for very low costs. Copyright prevents that.
This is an unabated good. Leaving things behind and forgetting them is how you get Tulsa Oaklahoma, or the Armenian Genocide denials. We don't get to choose what the future finds interesting, and for the first time in history, we do not have to. Why in the every loving fuck would you worry about that?
Most likely, only for personal reasons. This is a humanity level problem. Your personal worries are irrelevant in 100 years when everyone who ever knew you is dead anyway. Geocities would be more interesting at that time, as a subject of study.
Library of Congress, British Library, Bibliothèque Nationale etc choose to save everything they are mandated to, and a fair bit extra besides. That includes everything published. They don't save their water cooler chats, personal letters and everything sent by post, everything said on the phone or Facebook, etc.
The bar - perhaps found accidentally - seems quite important in deciding what must be archived, and what probably shouldn't.
Archives of personal letters and ephemera, preserved in manuscript/special collections libraries, are incredibly important research sources. This often includes letters which were never meant to published. LOC had a project to preserve every tweet (published to the world) until a few years ago - who knows what tweets might be useful to future researchers?
And yet, hundreds of years later historians and linguists crave for letters, and post, and telegrams to get a glimpse of actual life outside official publications.
Sure, and a hundred or more years later the family of the author, or relatives of the recipient can decide to release the family letters or telegram from WW1 or the US Civil War etc. That delay, usually at least until the correspondents have died, is important. The affair, the less than ideal belief, and all that other imperfect demonstration of humanity can no longer hurt or embarrass. It ceases to be private and personal and moves into the historic.
Releasing whilst the probably famous sender is alive is most often in the realms of to do damage, simply tasteless or paid for revelations in the gutter press.
> Leaving things behind and forgetting them is how you get Tulsa Oaklahoma, or the Armenian Genocide denials. We don't get to choose what the future finds interesting, and for the first time in history, we do not have to.
There is plenty of evidence for the Armenian genocide, the Holocaust, and 9/11. That doesn’t really stop deniers or conspiracy theorists. When it becomes politically advantageous, spreading misinformation becomes weaponized and mainstream. A bunch of nerds saving some ROM dumps isn’t going to really change that.
Like the library of Alexandria it’s also quite idealist to think archive.org will be around in 100 years or more. Not that we shouldn’t do it... but the future can be unkind to even all modern technology.
> We live in the only point in human history where we can actually save all of humanity's knowledge and culture,
Because the the welter of proprietary, undocumented formats, media bitrot and the like we are actually moving away from such a point.
Turns out historians may not be so upset. You can be a historian of early medieval France have a chance of reading 100% of the surviving documentation. Too much data can obscure the story.
Of course historically you could get a PhD for compiling a concordance to Shakespeare, something that can now be done mechanically in seconds. Future historians could (and will) apply the same tools to today's surviving documentation. But I don't believe there'll be as much of it as you seem to think.
>Because the the welter of proprietary, undocumented formats, media bitrot and the like we are actually moving away from such a point.
The best we can probably say is that it's different. We're capable of saving far more but, in practice, a lot of digital media is locked up in walled gardens and accounts that have to be paid for and require logins.
It's presumably easier to save a bunch of photographs or videos in a way that they'll be accessible so long as key Internet sites or their successors are. A fire or flood probably won't destroy them. OTOH, unless you've taken affirmative steps to upload that media to the right place, it won't be serendipitously discovered in a shoebox some day in the future.
> Turns out historians may not be so upset. You can be a historian of early medieval France have a chance of reading 100% of the surviving documentation. Too much data can obscure the story.
I don't understand this reasoning. Yes, more data = more work, but less data = more likely you're wrong.
Is your entire original comment meant to be read as sarcasm then?
Or are you advocating for the idea that history is arbitrary and it’s better to just have a simple story than to have to worry about what really happened?
I’m having a hard time understanding what idea you’re trying to position in this debate.
I was making three points in the three paragraphs:
1 - it's more likely we will be an information-sparse region in the historical record rather than an information-dense region.
2 - professional historians have their own set of incentives which can be counterintuitive to the layperson.
3 - but indeed if there turns out to be a huge amount of stuff (there will likely be mountains of some forms of ephemera) to go through some people may be able to find value using new tools not available in the past to historians.
As someone trained as (but never worked as) a historian I do indeed have a bit of cynicism on point 2. I suspect most if not all actually working in that domain have the same cynicism.
My understanding is that the jesuits burned the mesoamerican literature because they thought it was dangerous to their reign, not harmless. An appalling crime.
> You're right, let's burn the library down because one book has a liable chapter in it.
I feel you got the comment backwards: a better analogy would be "if a used-books store full of Dan Browns were to burn down, would we regret the loss of maybe one chapter that has some value?"
Your position seems to be "yes", but I wouldn't dismiss so easily the opposite view: that 90% of everything is crap, and that keeping everything forever "just in case" sounds surprisingly similar to hoarding.
I do not oppose "purposeful archiving" - as someone mentioned, saving outgoing Wikipedia links seems smart. But my old twitter account, where I kept track of missed trains? There are better sources for that, and no one missed it when it was gone.
It's almost impossible to evaluate what is of lasting value in the moment, while it is readily available.
Imagine an author writes a paperback. It isn't very good, but a few people read it. Later, one of those people goes on to rework some of those ideas into their own script for a film. The film is a success. Years later, the scriptwriter mentions the paperback as an inspiration while giving an interview, but it's long out of print.
To a biographer or a devoted fan of the film, this forgotten book, while of little value in and of itself has become a valuable part of a larger story. If it were culled when the contents of that used book store burned down, we would have lost something without realizing it. And that's how we lose most things. The only way to minimize this is to store as much as we can, in the hopes that we may find a use someday, and thankfully digital storage has made this very, very cheap. The opportunity cost is tiny, and the potential reward, given enough time, is unbounded.
But the opportunity cost is not tiny. This is literally a twitter thread asking for financial support.
And I do recognize that the thousands of petabytes will likely be chump change to store in a decade... but necessarily the economies of storage will keep pace with the rate of content production. It will always be expensive to store everything.
The question is, do we gain back this investment from future uses of these archives? I dunno. I’d be interested to hear what value archivists have gotten out of the archives, given it is decently old already.
Like with other "90% of X is wasted" sayings, you don't know which 90% it is.
Even if you look at classical art with an honest eye, you can find plenty of works that in themselves are, well, crap - but they're being preserved and reproduced and talked about, because they acquired meaning over time. They've become relevant in context.
Take your old Twitter account. It's probably not interesting. It probably won't ever be. But it might. Imagine several decades from now, your great-granddaughter becomes a well-known, influential politician. This might retroactively and posthumously make you relevant, and in the process your Twitter account. Biographists might find it useful. Or independently, people who're into historical train schedules. Etc.
It's near-impossible to predict what the future will find relevant, so if storing some memories is nearly free on the margin - as it is today, with digital technologies - then just storing it is a no-brainer.
I think a better analogy than a library would be your average day in the office: would you want everything you say and do in the office recorded for eternity? Sure it would help, say, catch fraudsters, track responsibility and credit, allow sociologists fascinating analysis - but is that worth it? The >1GB of Google+ is a good example. Probably many interesting posts from people that are the core experts on topic X - and many nonsensical Twitter-like posts of people sharing whatever they encountered or thought that day.
>> We live in the only point in human history where we can actually save all of humanity's knowledge and culture
Playing devils advocate here for a moment. . .
Considering we as humans learn little from our past, keeping all of this knowledge is a benefit to whom then? Some people who feel nostalgic about Sony's first walkman? Or maybe people using it for nefarious reasons? If humans continue to make the same historical mistakes over and over, what benefit does the human race gain from cataloging all this information? I would venture to guess, its more plausible it will be used against us instead of furthering our own culture.
>> We know more about how Rembrandt painted and lived than we do about how Atari 2600 programmers worked and lived
There is a huge difference between saving all of Rembrandt's stuff than it is some 22 year old college drop out programmer who created a video game in the hey days of long forgotten startup company. And yeah, there have been numerous documentaries, and articles written about Atari in those early days. Who would want to save a dilapidated roller rink under the auspices that a great and noble video game company used it as their HQ for a few years??
But then this roller rink down the block became available: 10,000 square feet! I mean, we were just jam-packed, and we had people on roller skates actually running around on the roller-skate rink building Pongs.
While I do think leaving certain things to the sands of time is a good thing, vacuuming up everything is just as worrisome. Are we going to be hoarders of a bygone technological past where a large majority of the "stuff" we save will have little, if any use to anybody anymore??
Having a background in anthropology, I find it fascinating there will be many generations of kids who leave no physical trace of their existence since a large majority will be in electronic form. Just imagine how people's lives are in a sort of suspended animation after passing away and having their Facebook pages live on forever.
I would say there are several classifications of things worth saving through a broad net:
- kindling sources, like a LiveJournal post that inspired Lin Manuel to write Hamilton (for a fake example)
- early work of a future star, like imagine Lorde posted early songs to MySpace. This is already a clear issue as many posted songs have been deleted or lost for various reasons.
- valuable things on shaky ground. Yahoo Groups, for the latest example. But I just saw on Reddit someone was looking for a deleted scene from Blair Witch Project that was supposedly the first video ever published on Amazon Prime Video... and now it has nearly vanished. That seems crazy to me from so many directions.
- the value of the ephemeral. Gold and jewelry from old civilizations is nice but we know so much of how people actually lived by examining their garbage, scrap notes, broken bowls, etc.
- the myth of permanence. We feel like 10 million people see a video, it is probably preserved. But there are no master tapes of any of this and so much of everything is interlinked and hard to piece together after the fact. What were people's tech stack when they were making MySpace? How big were people's hard drives? Did rhey share sonngs theough Kazaa or play them on MySpace directly? What was the state of Javascript then, what were the security issues or underground trends? How did songs propogate, where were they shared? Were people sending links in email or AIM, were people sharing links on Digg? This is stuff from like a decade or two and already you need to think like an archaeologist to have any sense of how the culture really existed because there were so many moving parts from year to year.
- the value of datasets. Imagine putting some thought against the Geocities archive to see how HTML blink tags grew then fell in popularity over time. Or how a meme propagated, or analyze the link structure between groups of people or by topic or any make any number of interesting inquiries about how humans operate culturally in digital space and how interact socially through certain set of tools and limitations. There are very interesting possibilities here for understanding ourselves better as a species.
>You're right, let's burn the library down because one book has a liable chapter in it.
It is more like, either you burn the library down or every thing you have written in your private journal is now available to be checked out by anyone.
It really shouldn't be that way and I think we should fix the problem of holding people responsible for bad behavior in the past. But how do we draw lines (for example, what about holding people responsible for past crimes).
>We need to save our culture and digital heritage, else we forget where we come from.
I agree, but we also need to ensure this is done without costing individuals. Technology has advanced, but society has not. Out technology outpacing our culture has and will continue to hurt many people and we should try to find a way to fix it.
It is appalling to me that the parent comment is being downvoted. Religious fervor indeed. The point being made is simply that saving every scrap of history including personal tracking and details that are normally LOST to history, is a sea-change in human history and shouldn't be looked at lightly.
There is value in forgetting. What we forget, then is a very relevant question. "NEVER FORGET ANYTHING RAWR!" is not a useful point of view because it denies the very right to have a conversation on the subject.
If you can't agree that I should have some say in what is remembered (or at least archived) about me or generated by me, there's not much we can talk about.
And to make it even clearer, one just needs to think about leaked images. Should we not allow a person to delete such images leaked without their consent?
1,000 years from now archaeologist may have some academic interest and those involved and even generations of their descendants are long dead.
But what about 1 year from now? Benefiting those 1000 years from now as the cost of those alive today is a hard position to justify, especially with such a blanket justification.
The Circle (both book and film) was, sadly, a largely botched attempt to explore issues like privacy, the power of tech companies, widespread surveillance, etc.
Missed opportunity in that the book was only readable if you took it as a deliberately over the top "if this goes on" fable. And the film was mostly notable for how it squandered a top-notch cast.
> Atari's old HQ is just another office building. The source code to those games is mostly gone (thankfully, it's assembly and easier to disassemble). We need to save our culture and digital heritage, else we forget where we come from.
Very good point!!!
In my view the Internet Archive should be the Digital equivalent of the the role of the National Register of Historic Places (NRHP). Shepherds of documentation, to give it a cool-ish sounding name.
My personal, obscure ISP user page (think the ~user/ era) from 1995 is preserved in all it's drop shadow blink tag marquee glory at archive.org with me doing nothing, it was just captured by whatever natural processes. The things I said on mailing lists, random forum posts etc. - it's all archived. That 90s stuff isn't/wasn't as ephemeral as folks think in my opinion, it's out there somewhere. $0.02 :)
> I'm not sure it's always so great that everything anyone does online will be permanently archived.
But you see, even if the Internet Archive didn't exist, someone would probably still be saving a copy of the things you do. It'd just be a megacorp or surveillance agency instead of a more egalitarian organization.
So the choice isn't "things on the internet are ephemeral" or "things on the internet are available forever to everyone", it's that or "things on the internet are available forever to some subset of the rich and powerful".
Maybe if everything would be archived forever, we could understand that everybody makes mistakes, and stop paying so much attention to old posts? Though I admit this is very optimistic view on human behavior.
Ancient posts rarely do get any attention though unless you are a politician and even then most people agree its worthless information. There was the recent event with the guy from Canada having photos of him wearing blackface almost 20 years ago and most people agreed that something so long ago is totally irrelevant to today.
Some mistakes are also worth recording. I like seeing bad predictions of the 2000s from the 1970s for example.
Not to mention that quite a lot what is archived today has been made by companies, there's no "right to be forgotten" that companies could ever deserve. For example I've uncovered quite a few mistakes in currently public datasets/websites based on archived sites, who knows how many mistakes are made now and never fixed because we lose the original sources. Point being that the lack of original source doesn't mean the information gets lost, it just becomes a big version of the kids game "telephone" where everyone recites what they heard and it gets distorted in the end.
>I'm not sure it's always so great that everything anyone does online will be permanently archived
The real problem here is the runaway cancel culture, where we attack people for things they said or did years or decades ago which were (at the time) perfectly acceptable and reasonable.
The most egregious example I have seen so far is cancel culture advocates who think we should disregard the late Richard Feynman’s legacy because he said some rude things to a lady back in 1946, even though the lady herself was not offended, since she did sleep with him later that same evening.
There’s a point where we just have to say “That was a long time ago, no one at the time was offended, get over it.”
Indeed. Context matters, and societies evolve over time. Opinions which we'd consider abhorrent today were, once upon a time, may have been acceptable.
A comment made years ago is only a reflection of a person's opinions at that point in time; opinions which may have changed since.
The consolidation and permanence of the web are definitely concerning.
Moving from "somebody knows this happened" or "this is in a file drawer somewhere" to "there's a searchable record of this" expand everyone's access to the info, and can do a lot to stave off forgetfulness and bit rot. But the people whose gain the most access are the ones who weren't involved in the first place, and the intersection of "uninvolved" and "cares enough to check" tends to be people who are actively hostile. Hence doxxing, stolen photos, and callouts over years-old tweets.
But that's a broad result of digitization. If a reporter or opposition researcher wants to embarrass someone, they can already look through digitized student newspaper essays, find interview subjects off class rolls, or simply comb through Twitter for long-forgotten offenses. (This holds for both good and ill - it applies to both serious skeletons and misleading or trivial issues.)
The Internet Archive, then, seems like sousveillance offsetting surveillance. For those who can point time, money, and connections at a target, it's enough that evidence exists, and more than enough that it's available online. But for the general public, it's much harder to keep track of countless sources or publicize news. If you can't dedicate interns and an archive to tracking every news story you read, you can't find or prove edits. (And while most newspapers noted corrections or morning/evening revisions, silently changing online stories has become common practice even for the likes of the BBC.) If you can't point out a webpage or tweet to thousands of people at once, the evidence is likely to be taken down before it's recognized. There are a lot of dedicated sites like NewsDiffs working on this problem, but Internet Archive provides a general-purpose answer to "let an average person see the history of a page or create a trusted record of it".
I worry that this just amounts to an eye for an eye, and still increases the total amount of scrutiny we're all under. But as long as more content is becoming permanent, it still seems better to have symmetrical access to it.
It's not actually true that everything anyone does online will be permanently archived. If it were, there would be no need for the Internet Archive.
The truth is, only the things someone has an interest in archiving will be archived, and only so long as someone has an interest in maintaining those archives. Just look at the recent announcement about Yahoo Groups... no one was, and likely no one is, going to permanently archive most of that. Sites, content and history get lost all the time.
I think it would be reasonable to establish a bar, similar to offline where everything above it is archived, and everything below is optional opt-in.
In the offline world the National Libraries get a copy of every book, magazine and newspaper published, by law. At least that's the way the UK and US do it. They archive a lot of other stuff as well, including music, audio, adverts, but that's more informal, and there is no requirement to preserve.
Personally I'd like things politicians and personalities (by dint of having chosen to live large) say online archived, all business (to later hold them to account) along with the sites of anyone in the business of influence - think tanks, parties, lobbyists, activists, "grass roots" organisations etc. Individuals, anon forums, HN and reddit subs and other places of shooting the breeze should be allowed to stay ephemeral. In fact I think conversation is freer that way - some will choose to say less, say different, or say nothing if all everyone says is forever...
In the US, mandatory deposit technically applies to any copyrighted work of any kind. We should fund LoC to enforce mandatory deposit on digitally published works as well.
In a sense this is also a good demarcation point. If something is serious enough to be worthy of copyright protection, it's probably worth archiving.
Funny you mention HN and Reddit as ephemeral because I always thought they were more permanent than most. While you can email the mods and ask that a particular post of yours be removed, I don’t think they will wholesale scrub your content out of the archives if you request, or help to anonymize them in any way.
I consider HN pretty much permanent and tread carefully with controversial opinions or things that might one day be considered not-PC.
What law in The US? In ancient times, like 40+ yrs ago, before the first big extension / copyright automatically granted at moment of creation. It used to be req to send copy to LoC to earn right to enforce copyright.
I've published several books the LoC does not have.
Also the national libraries aren't the sole archives of culture. Univ and private libs preserve all the important stuff government has not the interest or budget for.
To my surprise, I learned a few months back that mandatory deposit [1] is still actually a thing, albeit a completely unenforced thing. (I believe deposit is needed if you want to sue for damages, but in theory you're supposed to deposit in any case.)
Not only that, but I also wonder if we're overestimating the value of keeping all of this data around. Who's going to have the time to search and curate these mountains of information when we're generating tons more of it every day? I imagine the ideal goal is to allow future historians to learn about our past selves, but I think there's a tipping point where only those with lots of resources can afford to meaningfully consume it. Those typically are wealthy companies or individuals, and I'm generally less excited about what do with our information.
Obviously there's value in archiving some information, but a save all or even same most approach starts sounding a little hoarder-ish. Sure you might one day make use of that 1997 November TV guide, but chances are you won't and in the meantime you're paying the opportunity cost of storing it.
Maybe we need to take a page from Marie Kondo and only keep that which sparks joy and learn to let go of the rest. There's a chance someone will need a bit of info that no longer exists, but we'll probably be ok.
Part of the challenge here is that it's hard to know in advance what is or isn't worth archiving. It may only be clear a few years later that some big chunk of now-dead data was important.
In that sense, curating all of it doesn't really matter as long as you archived it. Someone trying to find the data later (or curate it!) can find their way to the right URLs using other sources, and then begin the process of curating this archived data after-the-fact.
The internet archive is most useful for when you click a link and it is dead which is very often. The wikipedia references are filled with dead links which now point to IA.
There is probably a lot of junk data on IA though especially video site archives but its worth keeping stuff that isn't needed if it means keeping stuff that was useful.
Well, there are some tools that have been developed that have pretty amazing capabilities to crunch through staggering quantities of data and come up with useful insights. It's basically big g's core competency, and there are tons of other companies that do the same thing, as well as open-source solutions that can be used.
In the not so distant future many people will record their entire lives: movements, utterances, biometrics, audiovisual and sensory data. Then they are going to freakout when dead people's lives start getting deleted because nobody is going to pay to host all this crap
I was having the same thoughts. Most of what I've used Archive for is to look up e.g. old blog posts for personalities that show their hypocrisy compared to today, for example. Or someone posted something daft when they were 15 and their handle was leaked and now it's out there forever, and we can laugh at how stupid they were.
I'm sure glad I went to efforts to scrub my personal sites I made when I was a teenager!
I don't think all blogs and personal content(for the lack of a better word) should just be archived. You should need consent. Most people have no idea it's going on. Or it should be very easy to delete something from the archive.
I'm convinced that this instinct to preserve everything forever is psychologically connected with the the denial of mortality. (Edit: I'm not saying this is a bad thing, just suggesting the phenomena may be connected.)
I think it’s actually simply evolution at work. That’s part of our evolutionary process as humans, building on historical achievements of our ancestors.
It's good to note that there's a difference between archive and (big A) Archive, as a practice and discipline. As far as I can tell, Archivists (like, people who went to school to be an Archivist), don't really agree with Jason Scott's agenda and approach.
On one hand, sure, library science and forensic analysis are extremely important, and nothing lasts forever, especially without the care of curators. We aren't dismissing traditional nor classical archival methods, and they already have taught us much about how to do digital archiving. [0]
On the other hand, clearly the Internet Archive is a competent digital archiver, and they've earned the capital-A "Archive". They publish a larger digital commons than anybody else, I think, especially at the low low price of gratis.
It sounds like your entire complaint is in two points. First, that IA doesn't ask (much) permission, which is unsurprising. The history of libraries is not one of asking permission, but of simply doing it. The public has been convinced repeatedly, over the decades, that libraries are good for them, and this public support helps insulate librarians from corporate interests.
Second, that IA doesn't employ enough women. I can't help you with that, but you are free to improve yourself.
Why is going to school for a subject a proxy for competency? Jason works for the Internet Archive. Perhaps "archivists" might consider more real world experience versus academic exercises?
Archives aren't new. It's not like software engineering, where you can cowboy shit and and blaze new trails without giving thought to what others in the past have done.
Not to mention that Archives is historically a female dominated industry. This is a real world example of a loud, boisterous man "disrupting" an industry.
It is entirely the wild west, where you can "cowboy shit and blaze new trails without giving thought to what others in the past have done" [1]. Anyone can be a digital archivist, anyone can run an archive (object store, metadata management, distribution). If I had to compare it to another industry, it'd be newspapers. Barrier to entry is low now (command line, compute, storage), and anyone can do it. This will continue as storage continues to decline in cost, tools get better (disclaimer: I maintain some tooling in this regard), and software improves for capturing physical materials as digital representations.
If someone thinks they can do better, they are free to try. No one is gatekeeping their attempt. Help yourself to some storage & VMs and write some code. If you do better (regardless of gender), everyone benefits.
[1] It's not bad to be able to wild west it and cowboy shit in non-regulated industries, where someone's life safety, finances, etc aren't at stake. Two cents.
Here are some snippets from Archivists that are actively talking about this:
>1. Archiving isn't just capturing data or downloading it, it's making it available into the future. Without an intense amount of planning around that, the act of capturing is pointless.
>2. We don't need all of Yahoo Groups. We need a subset of Yahoo Groups. Choosing what, exactly, is worth keeping is called appraisal in the archives world - not like monetary appraisal but cultural appraisal.
>3. We also don't need all of Yahoo Groups because hoarding data long-term is terrible for the environment. Digital preservation is also terrible for the environment. So we should be extremely judicious about what digital content we choose to attempt to retain permanently.
>4. It's an incredible violation of privacy and doesn't align with the ethics of the archives profession to collect all that data without permission from the people involved, especially in the case of private groups. People should also have the right to be forgotten.
>It’s also part of a long-term pattern of IA (and this dude in particular) deciding to “archive” things that people have not given consent for—and, in some cases, have explicitly asked not to be preserved.
There’s another gendered element here: IA tends to get tons of accolades and funding, and is largely seen as a group of do-gooder dudes just trying to preserve the internet. Meanwhile they routinely ignore or denigrate the work of librarians and archivists trained in digital preservation—professions that are overwhelmingly gendered as female.
>Plus Jason Scott is generally a dick to anyone who brings up any ethical qualms about their work (and he has a stupid hat in his avatar.)
I'm inclined to agree with this position. Does every Youtube, Reddit, Twitter, Hackernews, Facebook comment need to be archived and stored for the next thousand years?
I'd argue no, and there's a huge amount of waste in there - so many bot posts, or just spam.
But here we are, people want to archive every byte of information that traverses through the internet.
These days you have to approach cautiously, as every thing you do or post may be archived.
There are huge privacy concerns in the present day, but there's the other side of it too: today we consider it a treasure when we find "Maximus sucks a big dong" type graffiti on some wall in Pompeii. The glimpse into the life of the Romans is itself exhilarating. A thousand years hence you won't be around to care about how your embarrassing posts affect your reputation, and those who find it might be more grateful for the glimpse into early 21st century life than inclined to snigger or cringe at your comment.
I don't really worry too much. Between all the stuff that doesn't get archived, the fact that the sheer volume tends to effectively hide any single piece of information, and the frequent difficulty of connecting online identities to specific real people, it's not like most people have a perfectly discoverable complete digital record online.
Which is fine as long as you remain anonymous and unimportant in the present day. It becomes much more of an issue if you abruptly become a person of interest, and all of a sudden there's a team of people motivated to go through that archive with a comb looking for the few juicy tidbits needed to publicly humiliate you.
And obviously this is already happening— in the Canadian election a few months ago, there was a big scandal where some pictures turned up of the prime minister wearing blackface at a holiday party he attended in 2001. And on top of that there was a pretty steady stream of candidates (some of whom were indeed booted from their parties) who were challenged over social media comments/posts made 5+ years earlier, especially on hot social topics where national sentiment has rapidly evolved in recent times.
Perhaps we will all just become inured to this, and rightfully be able to accept our leaders as human, judging them for who they are today and not past words and deeds. But is it possible to still retain the ability to fairly judge someone in the present while forgiving the past? Or will we lose the ability to discern the difference? The GOP's attitude toward president Trump does not give me hope on this.
True enough. I'm probably pretty happy that nothing I wrote or photographed got online anywhere without going through an editor until well after college. And in pre-digital/smartphone days a heck of a lot less was recorded for prosperity than today. Not a lot of pictures from parties etc. even in my archives (and I did a lot of photography undergrad).
And, although I assume I'm in various Usenet archives, my participation there was always from a work address and was pretty tame. (BBSs mostly were too although all that content is gone anyway AFAIK.)
I know there are already firms that do online history scrubbing of people who get some warning that they're about to be entering the public eye, but that kind of thing is probably going to be a major growth area. Even if there's content out there that's beyond one's ability to clean up, just being forewarned/reminded of its existence could be an advantage.
I could definitely picture political parties and businesses being willing to pay $$$ for an internet presence dossier as part of their larger candidate vetting process.
Oct 29 12:28:53 <Cambion> we have logs that date back to 1993 and identify pretty much everyone from nickname change to nickname change
Oct 29 12:28:59 <Cambion> phone numbers, everything
So, copyright conditions are apparently another silent killer [1].
A website can be archived, vanish from the web, and then vanish from the archive for technical copyright reasons (new owner's robot.txt file on the root). So "archiving the archive" might be useful. Or something.
Ezboard was an old discussion site that contained much of interest - archived and now the archive is not accessible.
TIL they still follow robots.txt -- https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea... mentioned that they were planning to stop doing that (and I remember reading a news article based on it claiming that they already stopped following robots.txt, hence my confusion). Truly a shame. I get following robots.txt at collection time, I don't get following a robots.txt that was added later.
A while back I tried looking up a particular ezboard forum I used to participate in, only to find it was blocked. This is a real shame. I have some of the posts saved but it's only a fraction of the full forum.
> A shutdown announcement put these at risk. We worked with the founder, @tedr (RIP), who'd left the company, to save as much as we could. 3.5 Terabytes
I work for a company that owns a website containing information that in my opinion is valuable to the public. The website may go offline forever in the coming months. How can I get in touch with the Archive, to ensure that the content is saved? Parts of the content are not easy to index (e.g. there are "hidden" pages that you will only find if you have the exact URL), I can assist with that.
It seems they could save some money by moving a bunch of infrequently accessed data to warm storage. The entire archive does not need to be accessible 24/7.
I would be perfectly ok if I was trying to see a copy of a web page from five years ago, and it said that I had to make a request and it would be available in five or ten minutes.
I think I could wait five or ten minutes for a web page to get pulled from the archives.
Oh heck, I would be okay waiting for a few hours, or even a day for huge files. Though I suppose electricity isn't a huge driver for costs, afaict from that thread, the disks are.
A problem with most forms of warm storage is that they involve moving media around, which itself can lead to degradation.
I would like to see some write-one and really last forever form of storage to come into widespread use. That seems ... like it's still a few years off.
And you still have the issue of how such storage is read, and how that affects total lifetime.
As a compromise you could have it be more about availability and rate-limiting - if you've got a given piece of data split across 4 currently-live mirrored storage disks and you've got 32 front-end machines able to serve it up to users, you're using operational capacity for that which could be used for other higher-demand data. Assuming you have a robust backup/recovery strategy, low-demand data could be stored across 2 or 3 live disks (note: you'd still want backups somewhere for emergency recovery) instead, and have a smaller number of front-end machines dedicated to providing data to users.
In that case if the low-demand data front-end machines get too much load, you'd be entering a queue to be served.
Not sure that's realistic here though, it sounds like the main problem is just the volume of data so whether the storage for the data is hot, warm or cold isn't necessarily going to make a big difference and having enough frontends to serve the content isn't a big problem (in comparison) either
The Food and Drug Administration (FDA) has been archiving the FDA.gov site at archive-it.org .
Leaving a lot of dead links on the FDA site.
Sometimes they tell you to look in the archives for the old information, without giving you a link to it, and sometimes they don't, they just expect you to know.
Now why can't the FDA afford the space to keep their pages forever on their own site? Fill in your favorite conspiracy theory...
Some of the information that has been removed, such as the 2015 hearings on Fluoroquinolone antibiotics, are important health research as just one example.
> Now why can't the FDA afford the space to keep their pages forever on their own site? Fill in your favorite conspiracy theory...
This seems like a prime example of Hanlon's razor. Tight government budgets and lowest-bidder contractors not bothering with page permanence strike me as the most likely explanation.
It may also be worth observing that it's how pretty much any private company website operates. Arguably the FDA should be different and archive old, even outdated, content. But, while private companies may explicitly archive some materials like press releases and earnings reports, 99.9% of their focus is on the current content and they'll mostly just delete anything that's not in service of today.
It's not limited to the FDA, in the last 3-4 years many US agencies have purged data from their websites, and in some cases the employees themselves have protested this in public and begged people to make archives. It's great to have multiple archive sources for that stuff because you never know when it's going to disappear if politics are involved.
IA are a capture process. Wikipedia supports active updates to content, under contentious circumstances, as well as structuring content for useful access (something IA also does, but under looser conditions).
Both are highly worthwhile projects -- some of the best of the Web. But a straight apples-pomellos comparision is difficult.
Wikipedia also gets a lot of that done with free labor. And, over the years, their spend has increased quite a lot more than their content, update frequency, and so on.
The real "silent killer" I see here is the reliance on mirroring. One-failure protection, with a 2x expansion factor. As it happens, I work on large storage systems, where 2x is our maximum expansion factor and for that we get resistance to as many as nine simultaneous failures. Across power and network failure domains, with multiple kinds of background scrubbing to detect loss of that redundancy. Oh, and 60PB is something we might add to an existing cluster for a day to absorb transient I/O load. There's also a bunch of monitoring and automation stuff that should be considered "table stakes" for storage at these scales. Seems like an opportunity to use what I and others have learned for a good cause, to make this valuable resource more efficient and more durable all at once.
Did you get to the part where they explain that they are strapped for resources and in need of donations? They have plenty of expertise, money is what they need.
If they're only getting one-failure protection for a 2x expansion factor, then they clearly do need more expertise and more money won't solve their durability problem. With more money they could expand to 100x their size and have the exact same vulnerability to data loss. That's a problem worth solving.
Is your argument that they could have had 10x less vulnerability to data loss while operating at the same cost and backing up the same amount of total data? (If so, then this really is a valuable skill)
Or are you saying that they should have backed up 10x less data in order to have the 10x more redundant copies? That's a value decision on what to prioritize (more data saved vs 'good enough' resiliency).
Donations would be the best way to solve this problem so that they don't have to make this trade off in the first place.
The decision here was: For the given budget we have, where do we make the reliability vs data backed-up trade off? Increasing one necessarily decreases the other.
The options are:
- Increase reliability but decrease how much data capacity they have
- Increase data capacity at the cost of lower reliability
- Remove the "given the budget we have" constraint via fundraising. With more funding they can afford to increase both
You can see the decision they made for the fixed budget option. Now we can help them remove their constraint with extra donations (Every $5 helps. I've already donated)
Simply not true. Encoding schemes represent a whole different set of tradeoff possibilities which you don't seem to have considered, and simple replication is worse than other options in terms of both reliability and efficiency.
It's different techniques (erasure coding). As it turns out, applying those techniques would reduce need for additional physical resources as well, since it allows data to be stored more efficiently with the same physical resources. It does require more CPU and memory relative to bytes stored or spindles to store them on, but it's easy to make that tradeoff and still come out ahead.
No doubt Facebook considers theirs a competitive advantage, but I'm kind of surprised there isn't a reasonably robust OSS erasure-coding based object store available in 2019. Maybe LizardFS?
Ceph has an object-store interface (which might still account for the majority of its usage) and some people I used to work with on Gluster are now at Minio, so those are the two I'd look at if I needed something in that space.
I donated a week or so ago. The Internet Archive has come in handy many times for me, not just for the Wayback Machine but also things like their live music archives. They're an indispensable resource.
ROMs are legal at least here in the USA. Doesn't stop big companies from bullying little sites and becoming copyright trolls. The Internet archive will not be easily bullied to take down legally archived items thankfully.
The trick is you have to create the ROM yourself apparently. Sharing it is bad. I would assume copies made by anybody would be identical, so it would be interesting to see such a case in court against The Internet Archive, not that I want them to get sued, but would love for them to win if they did.
Nintendo is the most aggressive about ROMs than anybody.
Nintendo is the most aggressive about ROMs than anybody
They are, but they’re also the most proactive in making their old games available on new hardware. It wouldn’t be fair to call Super Mario Bros “abandonware” because Nintendo has done a lot to keep the game alive and accessible to new generations.
In many cases yes, but historically the Internet Archive has enjoyed legal exemptions to protect them, see https://archive.org/about/dmca.php for example
"Computer programs and video games distributed in formats that have become obsolete and which require the original media or hardware as a condition of access." can be argued to cover many of the video games they currently archive. You could probably argue this applies to classic nintendo ROM dumps and such as well but I doubt anyone wants to have a 10 year long, multi-million USD court battle with a notoriously abusive legal department like Nintendo's.
When I took an animal communication class in college, there was a formula for the value of information (not data).
I don't remember the exact formula, but it was similar Shannon's equation. Basically, the more valuable information is, the larger the change it affects in the probably of what an organisms next behavioral state is. So, if information signaling didn't change the behavior of another organism, it wasn't considered communication.
It is not the pineapple fund, although it has been a lovely time working with them. Regardless of the cryptocurrency discussion, the fact that someone who feels some sort of windfall would immediately have an urge to share it with organizations that need it is laudable.
IA and WBM are great and essential, like a Library of Congress/Smithsonian. What's frustrating about some old websites like Microsoft or Borland's FTP download area is that dynamic links weren't followed and can't be followed and websites that used user-agent filtering. CDN links also weren't captured well.
There's so many retro patches that just don't exist publicly. For example, a number of files on SciTech's IA's WBM have zero captures. Most FTP sites weren't captured in WBM adequately either. There are spots of FTP archives hosted here and there on IA and elsewhere, but they're not like WBM for static content sites, and a single snapshot archive lacks the history and the changes, before and after. It is what it is, unless folks donate their vintage personal/work local mirrors to add to the collective.
The Internet Archive is great, but it risks becoming a single failure point if we rely on it too much. Also it would be good to take some of the server load off of them. One possible solution is for smaller archives to exist. So if you're interested in archival, and you have some spare time and cash, consider not only donating to IA, but also setting up your own archive site with content on whatever category or topic that you found interesting enough to archive.
The British Library's web archive[1] does similar archiving work, but limits itself to 'British' sites - in other words, sites with a .uk domain. I've had good dealings with the admins before, when I submitted some of my personal sites for inclusion in the archive.
Interestingly, the British Library uses web crawlers based on the Internet Archive's Heritrix web crawler[2], which demonstrates how important IA's work is for many other archival organisations' work.
A linear amount of data could be saved if we extricated text content from the HTML skeleton that contains it.
I wish Semantic Web had taken off. "Pages with styling" was suboptimal. Web apps are a such a weird evolutionary branch we've descended into that don't relate to documents.
Content instead should have fallen under a type of ontology: news item, blog post, technical reference, comment, status update, ... If we'd adopted such a markup grammar and styled around it, we could parse out meaning, have stronger links in the graph, and compress.
Semantic Web would have happened if commercial web didn't outpace it.
Once you start transforming a document you always risk discarding information that later will turn out to be useful. Besides, I bet images, videos and other non-text data dwarf the space required by HTML markup.
TIA should get money from the UN. Or be the only beneficiary of a flat tax on network ports - I bet even just $5 on every small router sold in the US (which is basically nothing) would generate a ton of money for them.
It's too bad that there's not a vaguely-somehow-related-but-not-really and impossible-to-censor service that retains stuff that sites have excluded using robots.txt or whatever.
The good news is that the Wayback Machine stopped making robots.txt retroactive. That handles most of the cases I was running into where content I wanted was removed from the archive.
I was thinking more about the 90s era dream of uncensurable "data havens". That led to Freenet, for example. Which is slow, and forgets stuff that doesn't get accessed. And Tor onion services, which are more readily taken down.
But the problem is that there's no way to know in advance whether something is about to disappear from the Internet Archive. So you'd need someone inside who'd discreetly alert the backup service.
The Internet Archive is a fantastic and important initiative and we should definitely support it.
But:
Let's also support the public service players in its space that often get forgotten or marginalised by it's well funded marketing.
I'm thinking particularly of the various national libraries that preserve content under incredibly tight budgetary, PR and legal constraints that the IA is relatively free of.
On IA-aware media like HN, there is a tendency to present it as the only preservation initiative out there, which is absolutely not the case.
Some things about internet archives which isnt as obvious unless one has been there in person
- Housed in a church its not what a typical software development shop might look like, its mix of open office and workshop floor.
- A big workforce actually does a lot of manual work of scanning books and transferring from different media types
- Its a interesting experience to tour the data center which is in the office itself. I cannot remember but there was something special about those machines to control the heating
I don't understand why archive.org keeps all kinds of formats, .ogg and .mp3, .pdf and .jpg for the same resource? Why not just whatever the original format was? Sometimes its a dozen or so formats it seems like.
There are probably a lot of reasons, but I don't know where to chat about it to learn more.
edit: now that I think about it, if the original is pdf, then jpeg makes sense for loading one-page-at-a-time in the browser, but it seems like mp3 transcoding to ogg is reasonable to leave up to the user?
The derivations are auto generated for easier access by various clients. In the metadata, they are marked as original versus not original for later discernment and perhaps rerendering.
So, then, the convenience outweighs the disk space cost?
And also, I guess, transcoding the least-popular format on-the-fly is still too CPU intensive for large files (zips of hundreds of jp2 images, etc)
Guess I'm looking for a magic bullet where there might not be one, I just want to see Archive.org keep doing what it's doing far into the future.
Who knows, maybe we will stumble on some magic bullet, new compression algos (Zstandard? AVIF? AV1?), user clouds for compression (like boinc; archive already lets users assist via bittorrent for bandwidth costs). Thinking out loud.
Anyway, keep doin' what you're doing. And podcasting. That's good too :)
I found out about the Wayback Machine back in 2007 or so. I was young and just starting to explore the internet and found it incredibly fascinating to explore news sites as they were on the day of major historical events (9/11, etc). The amount of times I've relied on the Internet Archive by way of the Wayback Machine is numerous. I'll be adding them to my modest donation list, along with Wikipedia and similar public service sites.
The Internet Archive is amazing in many ways. Kind if sadly, for me, the most amazing thing is that they haven't been sued out of existence. How do they manage to operate with copyright laws being what they are?
Is there any way to download warcs collected by the internet archive? I'd like to try and back some up and use them for some search algorithm benchmarks.
This is one of the last bastions of the internet I remember as a kid. People publishing knowledge because they could. Sharing things that were interesting. Giving to other people. Sure there has always been seedier elements or commerce, but by and large this was a place to grow and share knowledge. By most every measure today the internet is a better place than it was but I still miss those days...
I’m somewhat curious about their backup methods - mirrored drives aren’t great even if they’re stored at two seperate locations.
Surely this would get enough support that they could host torrents of the content stored in chunks, and have many peers download and seed many chunks making the backups entirely distributed? I’d gladly seed a few hundred gigs of data to ensure they maintain good backup procedures.
Not sure why are you being downvoted. That was actually the first thing I thought - what is the backup strategy? Mirroring is for performance + service continuity.
Luckily that article from 2013 mentions:
> None of the Internet Archive's digitized data was lost in the fire as backups are held in multiple locations.
Perhaps they could get a grant from OpenAI or some similar AI research firm; someday, when the technology arrives, the Internet Archive will be the ultimate corpus.
While I tend to agree in this case, lots of libraries are behind a "paywall" in some form or other. The are explicitly membership-only private libraries but many, e.g. university libraries, also restrict access.
I wrote a Prime minicomputer emulator some years ago and put it online (telnet em.prirun.com 8001). The Prime was a minicomputer from the 80's. I worked with Primes for many years and also worked at the company as an OS specialist for 18 months.
Seven versions of the Primos operating system, from rev 18 to rev 24, have been recovered from disks, 9-track tapes, and 8mm backup tapes. The company died around 1992.
There has been a huge amount of Prime software that as far as I know, is lost to time. Oracle ran on it. SPSS. A native DBMS. Every one of the major OS revs had 5-10 minor revs. Some software products are only available for certain revs. I actually used rev 12, so at least half of the OS versions are completely missing. And Prime released source for their products. Most of that is missing.
For me personally, the emulator was / is a very rewarding project, and maybe for a handful of others who are still alive and used Primes in high school, college, or work. It's been really fascinating to "relive" my Prime days now and then, and others have made similar comments.
But how valuable is it really? Is it more than just a curiosity now? Sure, for a current or future computer historian, it might be viewed as a gold mine, but as time goes on, it has less and less value IMO, especially as the people who actual used one die off. If in a few years, only 10 people in the world care about such a thing, does it make sense to save it for all time? I have my doubts. And if I did have all of the versions of Primos, and all of the software ever written for Primes, would it make a difference even now, when there are people alive who actually used this computer system? Also seems doubtful.
To me, an "archive everything and let someone else figure out if it's useful" strategy seems impractical. If I had my choice, sure, I'd like to have every major version of the OS and all of the products for each version. But all of the dot-revs too? Nope - not important. Would I like all of the manuals? Yep!
Some of the products like Oracle, DBMS, etc. were a bitch to configure back in the day, and certainly would be even harder now with very few people around to help. So even if I did have them, I doubt I could get them running. And even if they were running, they were a specialty product at the time, with few experts, so finding someone today that would be interested in them would be a needle / haystack thing.
It all reminds me of family pictures. Before my grandma died, she went through a large box of B/W photos, explaining to us who all the people were, and we wrote on the back of the photos. But just 3 generations after her, the generation after me, no one knows the people in the photos (except my mom is in some - their grandma). In a way it seems that this family history is somehow important, but no one younger than me is interested at all. And truthfully, I never look at the pictures myself. I ended up scanning the photos with my mom or grandma, putting them on a digital picture frame for my nephews, and leaving the rest out. I asked them first - they didn't care about any of the others.
I'm not an archivist, but it seems to me that curation is a much more important aspect than grabbing everything and keeping it alive forever. Not curating seems like a "scale to infinity" problem. Those tend to not end well.
I think the IA is great project and am not intending to be critical, but giving my own perspective about preserving some very small piece of computer history.
Twitter is the worst choice for publishing. This involved so much more text-scrolling and photo-expanding than is necessary.
Edit: If anyone would care to explain why they feel that chopping an article up into pieces of arbitrary size makes for a good user experience, please do so. I know Twitter gets a pass from HN in general but it is simply not suited to this type of content by design. This was a conscious choice by its creators.
There really needs to be a new rule added to the posting guidelines about not complaining about the format when Twitter threads make it to the front page. No, it’s probably not the best possible format, but every single time the comments are inundated.
I'm sure if you offer to donate someone will put it in a blog format for you. Hint hint.
Edit: Twitter is obviously the correct platform for this because of reach. Hacker News is reading it, his 33.3k twitter followers are reading it, reddit is probably reading it, everyone's reading it! So if the goal was to draw more donations, he's winning.
On Twitter you get built-in liking, sharing and commenting, people will also DM you with questions, it's all one tap away. If you just post a link to a blog post, almost nobody will click it and read it there.
No one is disputing that Twitter is popular. Likes and retweets will not save the Internet Archive anymore than thoughts and prayers. People who are not willing to click a link will not be willing to make a donation.
It's not about Twitter being popular, it's about your content having organic reach. The liking and retweeting puts the story in front a lot more eyeballs.
Not to be overly pedantic, but tweets can contain links, including links that go directly to the fundraising call. The compact visual design for tweets make the URL and call to action very explicit and obvious. And people can choose to retweet/quote-tweet the tweet-with-a-fundraising-link for further emphasis.
For example, here's the thread author quote-tweeting the beginning of his thread, which allows him to prepend both a tl;dr and a link to "archive.org/donate"
If he wrote a blog post that includes the fundraising URL (e.g. at the top and the bottom) and tweeted a link to that blog post, you believe that that call to action would be more visible and obvious to the average reader?
I do Twitter threads once in awhile and for the creator, it's a very nice yet informal interface (e.g. compared to blogging) for posting a stream of connected thoughts, especially ones regularly punctuated with multimedia, such as images and video. There's the obvious benefit of the immediate network effect – a tweet thread has potentially far more reach, faster, than any RSS feed. There's also the benefit that, unlike a typical blog post, it's very easy for people to engage directly with a single thought (roughly equivalent to a paragraph in blog prose) – either to retweet your most cogent point, or to reply to it. And it's easy as the author to engage specifically on that point.
On a deeper level, content follows form. The limits of Twitter inherently force me to cut my wordiness and elaboration to a concise point. Of course whether this makes the overall thread better than a blog post is up to debate, but it most certainly makes composition easier on the creator.
Once in awhile I'll read the ThreaderApp's rollup of a thread, and almost always find it less of an optimal experience than just scrolling through the thread on mobile, though that may be attributable to differences in visual design. I can think of a number of Twitter threads that I simply cannot imagine working well in a blog or article format.
For example, here's a Twitter thread that was posted by a newspaper reporter, as she followed a mailman on his last day after 35 years:
AFAIK, she never posted this as an article on her newspaper's site – which to me, as a former newspaper reporter, is very surprising given publishers' demand to feed the content/pageview beast. I don't know if she ever explained this decision, or if the thread eventually did make its way into native article format. But I think the thread format worked beautifully here. By necessity, it minimized whatever tendency the reporter had to insert superfluous prose. And the Twitter visual design places heavy emphasis on the photos and videos, which each exemplified the aphorism of "a picture is worth a 1,000 words"
edit: the WaPo did do a recap of her thread [1]. Being a recap, of course, means it's inherently different than an original article about the mailman, but you still get a taste of the extra verbiage needed to connect scenes in traditional prose, which is not needed (maybe because people don't necessarily expect it) in a typical Twitter thread.
And again, the visual design aspect is critical imho. Twitter's format integrates images and video much more seamlessly than a standard website design, partly because tweets have a strict container max-width of ~500px, which is far narrower than most modern sites. And I suspect knowing that you're recording media for a Twitter thread fundamentally affects how you do it – e.g. you may feel less restrained about context and quality.
For example, one of the best tweets in the mailman thread imho is this one:
The text is very informal: It took forever to #MrFloyd to even get to his own party - people kept stopping him for photos and hugs!
But the attached video tells the story better and more immediately than any prose could. I can't imagine that 49-second raw footage clip being inserted into a standard story without being a massive distraction from the reading.
Personally, if I did any length of a Twitter thread, I'd want to also put it up as a blog if only for archive purposes. But I certainly understand why Twitter threads have become a popular way to write things.
I'm sure lots of people agree with your sentiment, it's just that this comment appears on every post of a Twitter thread so it starts to seem like noise.
It seems like all those things they backed up didn't seem to impact anyone's actual life. People lost files, websites, communications, whatever, and yet they kept right on living life. So I can't help but think that this is all a waste, like people who take pictures everywhere they go, and never look at them later.
Is there any useful purpose to all this other than nostalgia? And at what point does the nostalgia overwhelm the useful purpose?
There's a problem with the Wayback Machine in specific which can kill your ability to access it quite silently, unless you know how to use the browser's development tools and interpret headers.
It has to do with cookies: Somehow, the Wayback Machine sets cookies... and sets cookies... and keeps setting cookies, until it overflows its own ability to accept cookies. At that point, your browser tries to access a Wayback Machine page, handing the server all of the cookies it currently has, and the server refuses to deal. It absolutely denies everything, sending an error header and a blank page. You have to clear all web.archive.org cookies to get anything at all, at which point it works perfectly.
I've completely solved this problem by blacklisting web.archive.org in browser cookie blacklists. I haven't had it happen since then. As far as I'm concerned, the problem is diagnosed and just needs to be solved. At their end.
I've run into this problem as well. I used to use a Chrome extension which I authored that saved all of the pages I browsed to the wayback machine (in another tab), and I would frequently need to clear my cookies to keep the thing working.
https://en.wikipedia.org/wiki/Wikipedia:Link_rot
The Wikimedia Foundation's budget is about 10 times that of the Internet Archive. If you see the fundraising banner on Wikipedia and want to help out the site, but don't think the Wikimedia Foundation needs the donation, consider donating to the Internet Archive instead.
https://archive.org/donate