Hacker News new | past | comments | ask | show | jobs | submit login

Ever since the NYT legal case against OpenAI (pronounce: ClosedASI, not FossAGI; free as in your data for them, not free as in beer) there seems to be an underground current pulling into a riptide of closed information access on the web. Humorously enough the zimmit project has been quietly updating the living heck out of itself and awakening from a nearly 6-8 year slumber. The once simple format for making a mediawiki offline archive now is able to mirror any website complete with content such as video, pdf, or other files.

It feels a lot like the end of usenet or geocities, but this time without the incentive for the archivists to share their collections as openly. I am certain full scrapes of reddit and twitter exist, even with post API closure changes, but we will likely never see these leave large AI companies internal data holdings.

I have taken it upon myself to begin using the updated zimmit docker container to start archiving swaths of the 'useful web', meaning not just high quality language tokens, but high quality citations and knowledge built with sources that are not just links to other places online.

I started saving all my starred github repos into a folder and it came out just around 125gb of code.

I am terrified that in the very short term future a lot of this content will either become paywalled or the financial incentives of hosting large information repositories will increase past the point of current ad revenue based models as more powerful larger scraping operations seek to fill their petabytes while i try to prevent my few small TB of content i dont want to lose from slipping through my fingers.

If anyone actually cares deeply about content preservation, go and buy yourself a few 10+ TB external disks and grab a copy of zimmit and start pulling stuff. Put it on archive.org and tag it. So far the only zim files I see on archive.org are the ones publicly released by the kiwix team yet there is an entire wiki of wikis called wikiindex that remains almost completely unscraped. Fandom and Wikia are gigantic repositories of information and I fear they will close themselves up sooner than later, while many of the smaller info stores we have all come to take for granted as being "at our fingertips" will slowly slip away.

I first noticed the deep web deepening when things I used to be able to find on google were no longer showing up no matter how well I knew the content I was searching for, no matter the complex dorking i attempted using operators in the search bar, just like it had vanished. For a time bing was excellent at finding these "scrubbed" sites. Then duckduckgo entered the chat, and bing started to close itself down more. Bing was just a scrape of google, and google stopped being reliable, so downstream "search indexers" just became micro googles that were slightly out of date with slightly worse search accuracy, but those ghost pages were now being "anti-propagated" into these downstream indexers.

Yandex became and is still my preferred search engine when I actually need to find something online, especially when using operators to narrow wide pools.

I have found some rough edges with zimmit and I am planning on investigating and even submitting some PR upstream, but when an archive attempt takes 3 days to run before crashing and then wiping out the progress it has been hard to debug without the FOMO hitting that I should be spending the time getting what I can now before coming back to work on the code and get everything properly.

If any have the time to commit to the project and help make it more stable, perhaps work on some more fault recovery or failure continuation it would make archivists like me who are strapped for time very very happy.

Please go and make a dent in this, news is not the only part of the web i feel could be lost forever if we do not act to preserve it.

In 5 years time I see generic web searches being considered a legacy software and eventually decommissioned in favor of AI native conversational search (blow my brains out). I know for a fact all AI companies are doing massive data collection and structuring for graphrag style operations, my fear is that when its working well enough search will just vanish until a group of hobbyists make it available to us again.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: