Hacker News new | past | comments | ask | show | jobs | submit login
To preserve their work journalists take archiving into their own hands (niemanlab.org)
159 points by bcta1 5 months ago | hide | past | favorite | 91 comments



> “Thank goodness she did that because [otherwise] we would have no records of the early years of the first Women’s Hockey League in Canada,” Azzi said.

A few years ago, Canada digitized many older television shows, https://news.ycombinator.com/item?id=35716982

  With the help of many industry partners, the [Canada Media Fund] CMF team unearthed Canadian gems buried in analog catalogues. Once discovered, we worked to secure permissions and required rights and collaborate with third parties to digitize the works, including an invaluable partnership with Deluxe Canada that covered 40 per cent of the digitization costs. The new, high-quality digital masters were made available to the rights holders and released to the public on the Encore+ YouTube channel in English and French.
In late 2022, the channel deleted the entire Youtube Encore archive of Canadian television, with two weeks notice. A few months later, half of the archive resurfaced on https://archive.org/search?query=creator%3A%22Encore%20%2B%2.... If anyone independently archived the missing Encore videos from Youtube, please mirror them to Archive.org.


Archive.org is such a godsend.-

The entire information 'substrate' of society is ephemeral, if digital, and none (at least not enough) seem to have noticed.-


I wrote a book in 2010. It had a references section with links to about 100 websites. When I wrote the second edition only about five years later, 50% of those links no longer worked.

What we're doing right now is borderline insane. We're putting all of this information on the web, but almost each individual bit of information is dependent on either a company or a human being keeping it online. It's inevitable that companies change their minds, and humans die, so almost all of the information that is online right now will just disappear in the next 80 years.

And we essentially only have one single entity that tries to retain that information.


> references section with links to about 100 websites.

Books deserve a github repo with PDF web archives of referenced links, the same way that Wikipedia mirrors the content of cited links.


But wouldn't that be a big waste if everyone who references the same thing is then keeping a copy of it.


> big waste

Storage cost has fallen exponentially for decades, https://ourworldindata.org/data-insights/the-price-of-comput...


Redundancy isn't really a waste.


Better many copies than none. References usually mean written text and maybe some figures, cost of storage is going down, we can afford the duplication.


> everyone who references the same thing is then keeping a copy of it.

... and it would serve as a form of redundancy, imitating the fungible nature of physical media: In order to cite the latest, copied, manuscript (for example) you needed to own a physical copy. They existence of these has enabled survival of works that would otherwise have been lost, or even reconstruction through ecdotics.-


One person’s waste is another person’s resilience.


There are all kinds of publisher and legal issues. Trust me, I did my best.


Could Wikipedia or Archive.org offer references-as-a-service to book publishers for a small fee? They already have the infrastructure and legal cover.


certainly not GitHub


What would you recommend instead?


Copyright permitting, big QR code containing the plain text.


That entity goes out of its way to hide information if you're friends with the owners.


All the more reason not to rely on it, hard to complain when no one else is willing to do what they do


> wrote a book in 2010. It had a references section

Out of curiosity, may I ask what it was about?


> so almost all of the information that is online right now will just disappear in the next 80 years.

> And we essentially only have one single entity that tries to retain that information.

Will future ages find ours a dark age, a gap in their records, a void ...

... up until the point - if ever - where a sufficiently advanced solution for permanence is found and comes online?


> ... up until the point - if ever - where a sufficiently advanced solution for permanence is found and comes online?

Like the laser printer?

The cost of permanent, physical preservation is pennies. People just don't do it for most things. And it doesn't guarantee accessibility, which has hosting costs.


> Like the laser printer?

Sure. Whatever works.-

But I meant one that is systematically and systemically and widely used.-


> Will future ages find ours a dark age, a gap in their records, a void

I think this is a very likely future, yes.


Grim, indeed.-


Two big issues with Archive.org are that 1. it's a single point of failure, they don't encourage mirror sites to emerge, and 2. they keep using the "brand" to fight unwinnable battles like hosting books they don't own online, risking the whole endeavor.

I still appreciate it, but just imagine if it goes down due to a lawsuit. Now that Google no longer shows cached results, an entire historical record would be gone.


Its surprising that archive.org is the only such outfit I have encountered. Just like we have had libraries since ancient times, why are there so few digital libraries? There must be others, but nowhere near the number (or awareness) that we should have.

Heck, existing paper-based libraries should probably each include a digital archiving department.

Maybe this is already happening or already exists, and is trivial to those studying library science or something. I can hope, anyway.


There are lots of web archiving projects out there:

https://en.wikipedia.org/wiki/List_of_Web_archiving_initiati...

But the web is large. And public sector or academic librarian teams tend to be small. The IA's the one that people have coalesced around.


Excellent question.

Local neighborhood libraries could have their own curated digital archive, as cache for fast local search, and archival backup for long-term resilience.


> I still appreciate it, but just imagine if it goes down due to a lawsuit. Now that Google no longer shows cached results, an entire historical record would be gone.

Or somebody accidentally `rm -rf`'s an empty variable. Or The Big One hits San Fran. Or somebody in crisis breaks in with a crowbar, matchbook, and jug of gasoline.

They're a rather old-school shop. Own their own servers, all in one location I think. Bare metal admin stuff, and data's only mirrored across two disks per file IIRC. Keeps costs down. It's what makes the whole operation possible. But I also wonder sometimes.



If you check the issues, you'll learn this is not a supported project anymore (and honestly, it hardly worked even back then).


> Now that Google no longer shows cached results,

That was also the "end of an era" of sorts right there.-

> they don't encourage mirror sites to emerge,

Something over BitTorrent or blockchain would work well here, methinks. As a baseline substrate.-


Good thing I’m a hoarder. If I like something, I archive it and back it up locally. For example, a couple of days ago, I needed some digital assets for an Adobe program that I had downloaded a few months ago because I liked them and thought I might need them in the future. When I went back to the company page a couple days ago, everything had vanished! I'm glad I had downloaded them before and checked my backup to retrieve them.


Many many years ago I read a Fred Wilson (avc.com) blog post about a founder who vented that a tech journalist's article on the founder's company was a hit piece. The tech journalist was recommended by Fred Wilson who was an investor in the founder's company. Fred wrote the tech journalist rarely ever does their own independent research. The article's position must have come from the founder himself.

I can't find that article. And I have looked hard for it. I think now it must have been taken down. And that is sad. It had a valuable lesson that I want to share it with others these days.


A lot of the most interesting things are contentious and likely to be deleted. It's almost a law of the internet.


Do you engage in any kind of physical archiving? I'm not going so far as to say it's superior to digital archiving but it does simplify maintenance of the assets in the event that you're incapable of it.


Only few pictures I have them printed, the rest is all digital for the most part, unless you mean tapes/DVD/etc as a physical archive, then yeah I have that too.


A nice social attack is to create an internet archive looking website call it archive.newtld and use it to create social proof of things you didn't actually do. "Oh yeah the Washington Post did a redesign but here are my past 10 posts which I saved in archive: link"

In post truth internet, proving archives is going to be tough and unless there's some other form of verification it's going to be useless fast for "proving" purposes.

You can think bigger and do this to forge stories about anything you want in any website. Nobody checks authenticity of archive urls and there's several sites already, plus a lot of these services do URL re-writing, so it's hard unless there's some authorative thing.


I agree. Archives shutting down creates a huge risk for what I call "historical context attacks" in the future, similar to what you're describing.

It's about generating false historical context to support falsified digital artifacts.

I wrote an article about it for Fast Company in 2020:

https://www.fastcompany.com/90549441/how-to-prevent-deepfake...


Could this be solved by digital signatures on web content? (Or, a way to store those)


I designed it back in 2017. Called it the “permissionless timestamping network”.

https://intercoin.org/technology.pdf

You’d have to store the actual content. The network would just store the hash.

It’s just a Merkle DAG. The innovation is all about forcing nodes to timestamp everything if they timestamp anything.

Blockchains are overkill for this. Blockchains are for when content that was timestamped changes. Not when it just accumulates.

Also, I wrote this back in 2017 as an aspirational roadmap: https://github.com/Qbix/architecture/wiki/Internet-2.0


Yes, pretty easily. In general actively faking old data, and especially faking it to convince the public is not what should be our main concern. Any given archive site can try that once, with high risk of quickly being caught.


How do you obtain the key to check the signature?


Keybase would have been a good solution, essentially a GUI for a pgp keyserver but with social proofs like "the owner of this account also controls the domain washingtonpost.com (as evidenced by publishing the fingerprint as a TXT record)". Too bad about the zoom acquisition, development stalled since then.

I also don't know exactly what bag of bytes you would be checking the signature against, the raw normalized text of an story? Maybe if news articles were distributed as PDFs this would be a solved problem, but news sites actually don't want their content to be portable, they want you to sign in to their domain to view anything.


If you know you might want to prove something's authenticity later, post the SHA256 somewhere now. E.G. Multiple social media sites. Large and trusted web archives. Cryptocurrency blockchains. The last one's a stronger proof, with lots of money making sure it stays immutable.

Or hash everything you produce/consume. Then hash the list of hashes, and post that.

Or alternatively, counter forgeries by capturing more data. For web, but all data in general. Sensor RAWs to prove an image isn't "AI". Browser and network stack RAM dumps to prove a website's authenticity. Etc. There's what, a couple dozen accelerometers, GPS's, LIDARs etc. on the latest iPhones?


A content addressed network, yes.



I'm not a particularly good writer, but I've written about how I use the SingleFile extension to capture a perma web version of everything interesting that I read[0]. It's a great open source tool that aids in archiving (even if only at the personal level).

I've been taking notes and blogging since the early 2000s and coming back so often to find the content that I'd linked to has disappeared.

Archive.org and Archive Team do amazing work, but it's a mistake to put all your archiving eggs in one basket.

[0]: https://vertis.io/2024/01/26/how-singlefile-transformed-my-o...


How do you feel about this vs printing a pdf of the content?


I think both work, from a purely information point of view.

The SingleFile download preserves more of the original format. For a long while I was using MarkDownload and capturing the content that way, but a bunch is lost that way.

I also use Zotero for downloading journal articles (etc), that also has the ability to snapshot, but then I found it was locked up in Zotero. Where my current setup is a Jekyll repo on Vercel that means that the content is almost immediately accessible after the github push and deploy. Something that happens automatically after I click the SingleFile download button (configured in the extension).

I need do no more than grab the web link and paste it into Obsidian, where linking to Zotero from Obsidian is a royal pain (not impossible).


One of the great ironies of this situation is many of the now defunct websites had contracts and writing agreements that were absolutely egregious. Often the boilerplate would say that they owned the article (which they paid you a pittance for) until the end of all time.

Prior, in the print era, the standard agreement was they'd have the rights to your story upon publication then after a reasonable amount of time the rights would revert to the author.


Too bad there's not something like DOI or ARK [0] available for anyone to use to give documents a searchable, permanent ID so that a location can be maintained. IME, the half-life of many URLs (5-10 years?) makes them unreliable. I recently was unable to find (by URL) an entire historical collection at a major southern US university until I discovered that it had been moved to a new server.

[0] https://arks.org/about/


> the half-life of many URLs (5-10 years?) makes them unreliable.

"Simple" enough experiment, on this very site: Use the "past" feature on the main menu to go back a step at a time, and tally the number of broken links from the external submissions the further you go back.-

The amount of dead projects, expired domains, broken links, 404s, etc. is sad.-


Someone did that and found that only 5% of submitted URLs (200k/4M) were dead: https://blog.wilsonl.in/hackerverse/

Submitted 3 months ago: https://news.ycombinator.com/item?id=40307519


Thanks for the data. That's actually not bad. Then again, 5% of a "webscale" number of sites is still a lot.-

Thanks for pointing to that study ...


I fully support the efforts. but are there not legal problems with this? (No I dont thik legal issues should prevent this)

If I worked for CorporateMediaNews as a columnist and reporter for 10 years and they decide ot remove all of it. Does not CMN own the work and can (unfortunately) dispose of it if they so wish? I would not have any rights for the work?

Thinking about my own career. I have written a hell of a lot of code and 80% at least are closed source system for various companies. I dont retain any copies of that code.

It would be interesting if I heard that System X I wrote 15 years ago is being shut down, and I would try to obtain the source code in order to preserve it. I have never heard of anyone doing it, but probably in games and such it happens more often.


Journalists are often judged based on their public output. You get bylines on published articles, and you can use those to apply for jobs. There's not much emphasis on secrecy because the goal of a news outlet is to be read by as many people as possible.

Being unable to provide examples of your work is a career-killer if everyone else can point to stuff they wrote. Preserving your work isn't altruistic, it's required.

It's a very different environment compared to programming, because oftentimes the source code is valuable because it's a secret. Employees taking source code is dangerous because competitors can copy it; obviously that's bad for one's career.

The analogy doesn't work here because the cost/benefits of taking legal action are different.


A ton of writing is done for (sometimes very expensive) paywalled and subscription-only publications--or internal-only at companies. Theoretically all my (pre-blogging) analyst work is in that category. That said, no one reasonable cares much that you display that several year old thing you wrote for someone (absent confidential or classified information). I put "greatest hits" on my website and everything is either from a company that no longer exists or it's years-old material from a news website. The situation isn't ideal but it works better than the letter of the law suggests it should.


It's often probably at least a bit complicated. I cross-posted material between a couple of organizations (one of which is long gone) over a number of years via pretty much informal agreement. I also reused a fair bit of that material for other purposes. Everyone was OK with the state of affairs but who actually held the copyright? Who knows and I was certainly never going to bring it to a head.

As a practical matter if CorporateMediaNews or the like don't care about something any longer, they mostly don't care if someone else makes use of it so long as it isn't embarrassing or misrepresenting the organization.

In the end, I have created work that I've reused for a variety of organizations as well as independently and someone other than myself probably would claim copyright to but it's often been pretty loose.


Random old source code is probably pretty hard to do something useful with. But, honestly, random old work product is pretty hard for just about anyone to do something useful with. And even something self-contained (like writing), most of the the stuff I wrote 10 or 20 years ago is pretty uninteresting even if a few things have some historical interest.


Does that (not too often) not happen with "abandonware" turned open source?


It rarely turns open source. Someone releases it and no one cares enough to try to do anything about it.


So if there was a hub where a reporter can send the url of their article when it’s published and the hub then saves that page as text (lynx —dump or whatever) if it’s not paywalled. That would be ok I guess until the hub makes it accessible. Or would it be ok since it was publicly available on the net at one time if the hub only publishes when the original url goes dark?


- https://archive.today (works with some paywall content)

- https://web.archive.org

etc


Thank you for sharing but that is not news, that’s everything and therefore not searchable. It will have the reporters article and every webpage that quoted, interpreted, summarized and commented on it.


My understanding is that some photographers are archiving their digital pictures by basically printing them using the 4 color process, which gives them the 4 "negatives" ("postives?" or whatever they're called) for each color (CMYK I guess).

Those sheets are archival quality and should last for quite sometime, given a proper storage environment.

They can always use those later to have them scanned back in should they lose their master digital files.


With cheap 20+Tb drives available, just buy a new drive every couple of years and copy the files forward.


PAR archives FTW :)



This is the only way, in my opinion. Not just for journalists, but for all professions. If you haven't archived it yourself, on machines and/or media that you are in possession of, then you can't rely on it to continue to persist.


This has been true for a long time. Had I not archived a fair bit of my own work, some of it in the CMSs of dead organizations, some of it inaccessible behind paywalls, much would no longer exist. Journalists are probably in better shape than many because they're more likely to have work they've created on a relatively open web.


It's great to see more non-programmers realize how ephemeral Web content is, and taken bare-bones archiving efforts.

If you or someone you know are looking to archive content from the Web, but don't know how, I'll be happy to help. My email is in my profile.


Someone should create an immutable article hub repository. They can call it pubark.


Journos discover backups!

Yay.


Personal backups are one thing. Creating a permanent (whatever that means) record is another.


I used to work in a helicopter factory. My company now holds records for some very old aircraft.

I'm well aware about bit rot.


They should really collaborate with archive.org. They won't shut things down or paywall it.


Archive.org can and does accept takedown requests, even if the requester wants to avoid public scrutiny. If you're writing about a contentious topic and want to preserve your links (tweets or whatever), there are better options.


What are some of those options?


There’s an addon called single file, you can use it to save the whole page with assets (except videos)

https://github.com/gildas-lormeau/SingleFile


Is there a to that I can script, to use the cookies on my existing web browser (because I'm logged into some websites) and get the page content text? All while clicking away pop-up banners (newsletter, cookies, etc)?

I would script that to go over my local bookmarks file.


Reader mode does this in Safari. Adding to Reading List also stores the content for offline reading.

I bookmark to a service (Linkding, self-hosted) that automatically sends the URL to be archived at Wayback Machine.


Thank you. I should have been clearer and mentioned that I'm on Debian ))

But in any case, Safari's reader mode can be scripted from the CLI? That is good to know, maybe I'll try to find something similar for Firefox's Reader Mode. Thank you.


Yes. On macOS we would generally use osascript on the command line. Apple Script Editor shows which functions each app exposes. Very powerful combo.


That's great to know, I consider an Apple device every few years when I upgrade my current devices. Osascript sounds compelling.


Safari has this, it can save as a Web Archive which is a sort of zip containing the html and assets.



Not all of these are 1:1 replacements but here are a few options: archive.is/archive.today, ghost archive, save a webpage as html with assets, and taking a screenshot


There really aren't. You can publish a book I guess but that has limited reach. And your website isn't going to be around forever.



Ever since the NYT legal case against OpenAI (pronounce: ClosedASI, not FossAGI; free as in your data for them, not free as in beer) there seems to be an underground current pulling into a riptide of closed information access on the web. Humorously enough the zimmit project has been quietly updating the living heck out of itself and awakening from a nearly 6-8 year slumber. The once simple format for making a mediawiki offline archive now is able to mirror any website complete with content such as video, pdf, or other files.

It feels a lot like the end of usenet or geocities, but this time without the incentive for the archivists to share their collections as openly. I am certain full scrapes of reddit and twitter exist, even with post API closure changes, but we will likely never see these leave large AI companies internal data holdings.

I have taken it upon myself to begin using the updated zimmit docker container to start archiving swaths of the 'useful web', meaning not just high quality language tokens, but high quality citations and knowledge built with sources that are not just links to other places online.

I started saving all my starred github repos into a folder and it came out just around 125gb of code.

I am terrified that in the very short term future a lot of this content will either become paywalled or the financial incentives of hosting large information repositories will increase past the point of current ad revenue based models as more powerful larger scraping operations seek to fill their petabytes while i try to prevent my few small TB of content i dont want to lose from slipping through my fingers.

If anyone actually cares deeply about content preservation, go and buy yourself a few 10+ TB external disks and grab a copy of zimmit and start pulling stuff. Put it on archive.org and tag it. So far the only zim files I see on archive.org are the ones publicly released by the kiwix team yet there is an entire wiki of wikis called wikiindex that remains almost completely unscraped. Fandom and Wikia are gigantic repositories of information and I fear they will close themselves up sooner than later, while many of the smaller info stores we have all come to take for granted as being "at our fingertips" will slowly slip away.

I first noticed the deep web deepening when things I used to be able to find on google were no longer showing up no matter how well I knew the content I was searching for, no matter the complex dorking i attempted using operators in the search bar, just like it had vanished. For a time bing was excellent at finding these "scrubbed" sites. Then duckduckgo entered the chat, and bing started to close itself down more. Bing was just a scrape of google, and google stopped being reliable, so downstream "search indexers" just became micro googles that were slightly out of date with slightly worse search accuracy, but those ghost pages were now being "anti-propagated" into these downstream indexers.

Yandex became and is still my preferred search engine when I actually need to find something online, especially when using operators to narrow wide pools.

I have found some rough edges with zimmit and I am planning on investigating and even submitting some PR upstream, but when an archive attempt takes 3 days to run before crashing and then wiping out the progress it has been hard to debug without the FOMO hitting that I should be spending the time getting what I can now before coming back to work on the code and get everything properly.

If any have the time to commit to the project and help make it more stable, perhaps work on some more fault recovery or failure continuation it would make archivists like me who are strapped for time very very happy.

Please go and make a dent in this, news is not the only part of the web i feel could be lost forever if we do not act to preserve it.

In 5 years time I see generic web searches being considered a legacy software and eventually decommissioned in favor of AI native conversational search (blow my brains out). I know for a fact all AI companies are doing massive data collection and structuring for graphrag style operations, my fear is that when its working well enough search will just vanish until a group of hobbyists make it available to us again.


[flagged]


No discussion == no dupe.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: