> “Thank goodness she did that because [otherwise] we would have no records of the early years of the first Women’s Hockey League in Canada,” Azzi said.
With the help of many industry partners, the [Canada Media Fund] CMF team unearthed Canadian gems buried in analog catalogues. Once discovered, we worked to secure permissions and required rights and collaborate with third parties to digitize the works, including an invaluable partnership with Deluxe Canada that covered 40 per cent of the digitization costs. The new, high-quality digital masters were made available to the rights holders and released to the public on the Encore+ YouTube channel in English and French.
In late 2022, the channel deleted the entire Youtube Encore archive of Canadian television, with two weeks notice. A few months later, half of the archive resurfaced on https://archive.org/search?query=creator%3A%22Encore%20%2B%2.... If anyone independently archived the missing Encore videos from Youtube, please mirror them to Archive.org.
I wrote a book in 2010. It had a references section with links to about 100 websites. When I wrote the second edition only about five years later, 50% of those links no longer worked.
What we're doing right now is borderline insane. We're putting all of this information on the web, but almost each individual bit of information is dependent on either a company or a human being keeping it online. It's inevitable that companies change their minds, and humans die, so almost all of the information that is online right now will just disappear in the next 80 years.
And we essentially only have one single entity that tries to retain that information.
Better many copies than none. References usually mean written text and maybe some figures, cost of storage is going down, we can afford the duplication.
> everyone who references the same thing is then keeping a copy of it.
... and it would serve as a form of redundancy, imitating the fungible nature of physical media: In order to cite the latest, copied, manuscript (for example) you needed to own a physical copy. They existence of these has enabled survival of works that would otherwise have been lost, or even reconstruction through ecdotics.-
> ... up until the point - if ever - where a sufficiently advanced solution for permanence is found and comes online?
Like the laser printer?
The cost of permanent, physical preservation is pennies. People just don't do it for most things. And it doesn't guarantee accessibility, which has hosting costs.
Two big issues with Archive.org are that 1. it's a single point of failure, they don't encourage mirror sites to emerge, and 2. they keep using the "brand" to fight unwinnable battles like hosting books they don't own online, risking the whole endeavor.
I still appreciate it, but just imagine if it goes down due to a lawsuit. Now that Google no longer shows cached results, an entire historical record would be gone.
Its surprising that archive.org is the only such outfit I have encountered. Just like we have had libraries since ancient times, why are there so few digital libraries? There must be others, but nowhere near the number (or awareness) that we should have.
Heck, existing paper-based libraries should probably each include a digital archiving department.
Maybe this is already happening or already exists, and is trivial to those studying library science or something. I can hope, anyway.
Local neighborhood libraries could have their own curated digital archive, as cache for fast local search, and archival backup for long-term resilience.
> I still appreciate it, but just imagine if it goes down due to a lawsuit. Now that Google no longer shows cached results, an entire historical record would be gone.
Or somebody accidentally `rm -rf`'s an empty variable. Or The Big One hits San Fran. Or somebody in crisis breaks in with a crowbar, matchbook, and jug of gasoline.
They're a rather old-school shop. Own their own servers, all in one location I think. Bare metal admin stuff, and data's only mirrored across two disks per file IIRC. Keeps costs down. It's what makes the whole operation possible. But I also wonder sometimes.
Good thing I’m a hoarder. If I like something, I archive it and back it up locally. For example, a couple of days ago, I needed some digital assets for an Adobe program that I had downloaded a few months ago because I liked them and thought I might need them in the future. When I went back to the company page a couple days ago, everything had vanished! I'm glad I had downloaded them before and checked my backup to retrieve them.
Many many years ago I read a Fred Wilson (avc.com) blog post about a founder who vented that a tech journalist's article on the founder's company was a hit piece. The tech journalist was recommended by Fred Wilson who was an investor in the founder's company. Fred wrote the tech journalist rarely ever does their own independent research. The article's position must have come from the founder himself.
I can't find that article. And I have looked hard for it. I think now it must have been taken down. And that is sad. It had a valuable lesson that I want to share it with others these days.
Do you engage in any kind of physical archiving? I'm not going so far as to say it's superior to digital archiving but it does simplify maintenance of the assets in the event that you're incapable of it.
Only few pictures I have them printed, the rest is all digital for the most part, unless you mean tapes/DVD/etc as a physical archive, then yeah I have that too.
A nice social attack is to create an internet archive looking website call it archive.newtld and use it to create social proof of things you didn't actually do. "Oh yeah the Washington Post did a redesign but here are my past 10 posts which I saved in archive: link"
In post truth internet, proving archives is going to be tough and unless there's some other form of verification it's going to be useless fast for "proving" purposes.
You can think bigger and do this to forge stories about anything you want in any website. Nobody checks authenticity of archive urls and there's several sites already, plus a lot of these services do URL re-writing, so it's hard unless there's some authorative thing.
Yes, pretty easily. In general actively faking old data, and especially faking it to convince the public is not what should be our main concern. Any given archive site can try that once, with high risk of quickly being caught.
Keybase would have been a good solution, essentially a GUI for a pgp keyserver but with social proofs like "the owner of this account also controls the domain washingtonpost.com (as evidenced by publishing the fingerprint as a TXT record)". Too bad about the zoom acquisition, development stalled since then.
I also don't know exactly what bag of bytes you would be checking the signature against, the raw normalized text of an story? Maybe if news articles were distributed as PDFs this would be a solved problem, but news sites actually don't want their content to be portable, they want you to sign in to their domain to view anything.
If you know you might want to prove something's authenticity later, post the SHA256 somewhere now. E.G. Multiple social media sites. Large and trusted web archives. Cryptocurrency blockchains. The last one's a stronger proof, with lots of money making sure it stays immutable.
Or hash everything you produce/consume. Then hash the list of hashes, and post that.
Or alternatively, counter forgeries by capturing more data. For web, but all data in general. Sensor RAWs to prove an image isn't "AI". Browser and network stack RAM dumps to prove a website's authenticity. Etc. There's what, a couple dozen accelerometers, GPS's, LIDARs etc. on the latest iPhones?
I'm not a particularly good writer, but I've written about how I use the SingleFile extension to capture a perma web version of everything interesting that I read[0]. It's a great open source tool that aids in archiving (even if only at the personal level).
I've been taking notes and blogging since the early 2000s and coming back so often to find the content that I'd linked to has disappeared.
Archive.org and Archive Team do amazing work, but it's a mistake to put all your archiving eggs in one basket.
I think both work, from a purely information point of view.
The SingleFile download preserves more of the original format. For a long while I was using MarkDownload and capturing the content that way, but a bunch is lost that way.
I also use Zotero for downloading journal articles (etc), that also has the ability to snapshot, but then I found it was locked up in Zotero. Where my current setup is a Jekyll repo on Vercel that means that the content is almost immediately accessible after the github push and deploy. Something that happens automatically after I click the SingleFile download button (configured in the extension).
I need do no more than grab the web link and paste it into Obsidian, where linking to Zotero from Obsidian is a royal pain (not impossible).
One of the great ironies of this situation is many of the now defunct websites had contracts and writing agreements that were absolutely egregious. Often the boilerplate would say that they owned the article (which they paid you a pittance for) until the end of all time.
Prior, in the print era, the standard agreement was they'd have the rights to your story upon publication then after a reasonable amount of time the rights would revert to the author.
Too bad there's not something like DOI or ARK [0] available for anyone to use to give documents a searchable, permanent ID so that a location can be maintained. IME, the half-life of many URLs (5-10 years?) makes them unreliable. I recently was unable to find (by URL) an entire historical collection at a major southern US university until I discovered that it had been moved to a new server.
> the half-life of many URLs (5-10 years?) makes them unreliable.
"Simple" enough experiment, on this very site: Use the "past" feature on the main menu to go back a step at a time, and tally the number of broken links from the external submissions the further you go back.-
The amount of dead projects, expired domains, broken links, 404s, etc. is sad.-
I fully support the efforts.
but are there not legal problems with this?
(No I dont thik legal issues should prevent this)
If I worked for CorporateMediaNews as a columnist and reporter
for 10 years and they decide ot remove all of it.
Does not CMN own the work and can (unfortunately) dispose of it if
they so wish? I would not have any rights for the work?
Thinking about my own career.
I have written a hell of a lot of code and 80% at least are
closed source system for various companies.
I dont retain any copies of that code.
It would be interesting if I heard that System X I wrote 15
years ago is being shut down, and I would try to obtain the
source code in order to preserve it.
I have never heard of anyone doing it, but probably in games
and such it happens more often.
Journalists are often judged based on their public output. You get bylines on published articles, and you can use those to apply for jobs. There's not much emphasis on secrecy because the goal of a news outlet is to be read by as many people as possible.
Being unable to provide examples of your work is a career-killer if everyone else can point to stuff they wrote. Preserving your work isn't altruistic, it's required.
It's a very different environment compared to programming, because oftentimes the source code is valuable because it's a secret. Employees taking source code is dangerous because competitors can copy it; obviously that's bad for one's career.
The analogy doesn't work here because the cost/benefits of taking legal action are different.
A ton of writing is done for (sometimes very expensive) paywalled and subscription-only publications--or internal-only at companies. Theoretically all my (pre-blogging) analyst work is in that category. That said, no one reasonable cares much that you display that several year old thing you wrote for someone (absent confidential or classified information). I put "greatest hits" on my website and everything is either from a company that no longer exists or it's years-old material from a news website. The situation isn't ideal but it works better than the letter of the law suggests it should.
It's often probably at least a bit complicated. I cross-posted material between a couple of organizations (one of which is long gone) over a number of years via pretty much informal agreement. I also reused a fair bit of that material for other purposes. Everyone was OK with the state of affairs but who actually held the copyright? Who knows and I was certainly never going to bring it to a head.
As a practical matter if CorporateMediaNews or the like don't care about something any longer, they mostly don't care if someone else makes use of it so long as it isn't embarrassing or misrepresenting the organization.
In the end, I have created work that I've reused for a variety of organizations as well as independently and someone other than myself probably would claim copyright to but it's often been pretty loose.
Random old source code is probably pretty hard to do something useful with. But, honestly, random old work product is pretty hard for just about anyone to do something useful with. And even something self-contained (like writing), most of the the stuff I wrote 10 or 20 years ago is pretty uninteresting even if a few things have some historical interest.
So if there was a hub where a reporter can send the url of their article when it’s published and the hub then saves that page as text (lynx —dump or whatever) if it’s not paywalled. That would be ok I guess until the hub makes it accessible. Or would it be ok since it was publicly available on the net at one time if the hub only publishes when the original url goes dark?
Thank you for sharing but that is not news, that’s everything and therefore not searchable. It will have the reporters article and every webpage that quoted, interpreted, summarized and commented on it.
My understanding is that some photographers are archiving their digital pictures by basically printing them using the 4 color process, which gives them the 4 "negatives" ("postives?" or whatever they're called) for each color (CMYK I guess).
Those sheets are archival quality and should last for quite sometime, given a proper storage environment.
They can always use those later to have them scanned back in should they lose their master digital files.
This is the only way, in my opinion. Not just for journalists, but for all professions. If you haven't archived it yourself, on machines and/or media that you are in possession of, then you can't rely on it to continue to persist.
This has been true for a long time. Had I not archived a fair bit of my own work, some of it in the CMSs of dead organizations, some of it inaccessible behind paywalls, much would no longer exist. Journalists are probably in better shape than many because they're more likely to have work they've created on a relatively open web.
Archive.org can and does accept takedown requests, even if the requester wants to avoid public scrutiny. If you're writing about a contentious topic and want to preserve your links (tweets or whatever), there are better options.
Is there a to that I can script, to use the cookies on my existing web browser (because I'm logged into some websites) and get the page content text? All while clicking away pop-up banners (newsletter, cookies, etc)?
I would script that to go over my local bookmarks file.
Thank you. I should have been clearer and mentioned that I'm on Debian ))
But in any case, Safari's reader mode can be scripted from the CLI? That is good to know, maybe I'll try to find something similar for Firefox's Reader Mode. Thank you.
Not all of these are 1:1 replacements but here are a few options: archive.is/archive.today, ghost archive, save a webpage as html with assets, and taking a screenshot
Ever since the NYT legal case against OpenAI (pronounce: ClosedASI, not FossAGI; free as in your data for them, not free as in beer) there seems to be an underground current pulling into a riptide of closed information access on the web. Humorously enough the zimmit project has been quietly updating the living heck out of itself and awakening from a nearly 6-8 year slumber. The once simple format for making a mediawiki offline archive now is able to mirror any website complete with content such as video, pdf, or other files.
It feels a lot like the end of usenet or geocities, but this time without the incentive for the archivists to share their collections as openly. I am certain full scrapes of reddit and twitter exist, even with post API closure changes, but we will likely never see these leave large AI companies internal data holdings.
I have taken it upon myself to begin using the updated zimmit docker container to start archiving swaths of the 'useful web', meaning not just high quality language tokens, but high quality citations and knowledge built with sources that are not just links to other places online.
I started saving all my starred github repos into a folder and it came out just around 125gb of code.
I am terrified that in the very short term future a lot of this content will either become paywalled or the financial incentives of hosting large information repositories will increase past the point of current ad revenue based models as more powerful larger scraping operations seek to fill their petabytes while i try to prevent my few small TB of content i dont want to lose from slipping through my fingers.
If anyone actually cares deeply about content preservation, go and buy yourself a few 10+ TB external disks and grab a copy of zimmit and start pulling stuff. Put it on archive.org and tag it. So far the only zim files I see on archive.org are the ones publicly released by the kiwix team yet there is an entire wiki of wikis called wikiindex that remains almost completely unscraped. Fandom and Wikia are gigantic repositories of information and I fear they will close themselves up sooner than later, while many of the smaller info stores we have all come to take for granted as being "at our fingertips" will slowly slip away.
I first noticed the deep web deepening when things I used to be able to find on google were no longer showing up no matter how well I knew the content I was searching for, no matter the complex dorking i attempted using operators in the search bar, just like it had vanished. For a time bing was excellent at finding these "scrubbed" sites. Then duckduckgo entered the chat, and bing started to close itself down more. Bing was just a scrape of google, and google stopped being reliable, so downstream "search indexers" just became micro googles that were slightly out of date with slightly worse search accuracy, but those ghost pages were now being "anti-propagated" into these downstream indexers.
Yandex became and is still my preferred search engine when I actually need to find something online, especially when using operators to narrow wide pools.
I have found some rough edges with zimmit and I am planning on investigating and even submitting some PR upstream, but when an archive attempt takes 3 days to run before crashing and then wiping out the progress it has been hard to debug without the FOMO hitting that I should be spending the time getting what I can now before coming back to work on the code and get everything properly.
If any have the time to commit to the project and help make it more stable, perhaps work on some more fault recovery or failure continuation it would make archivists like me who are strapped for time very very happy.
Please go and make a dent in this, news is not the only part of the web i feel could be lost forever if we do not act to preserve it.
In 5 years time I see generic web searches being considered a legacy software and eventually decommissioned in favor of AI native conversational search (blow my brains out). I know for a fact all AI companies are doing massive data collection and structuring for graphrag style operations, my fear is that when its working well enough search will just vanish until a group of hobbyists make it available to us again.
A few years ago, Canada digitized many older television shows, https://news.ycombinator.com/item?id=35716982
In late 2022, the channel deleted the entire Youtube Encore archive of Canadian television, with two weeks notice. A few months later, half of the archive resurfaced on https://archive.org/search?query=creator%3A%22Encore%20%2B%2.... If anyone independently archived the missing Encore videos from Youtube, please mirror them to Archive.org.