Archiving data from the Internet Archive is useful, but I would argue there's a greater need going unfulfilled: local direct site archiving. Twenty years ago you could punch a URL into HTTRACK and get a local site copy that was indistinguishable from the live version. As far as I know there's nothing like that available today. The closest methods I know involve stringing together four or five entirely separate utilities to get one saved site.
Cloudflare has more or less killed white-hat crawling en large. On top, you have sites like Bloomberg where even solutions like Puppeteer-stealth will fail at some point.
Anyone looking into this, I'd recommend python-playwright. Good browser engines and the ability to hot patch every request means you don't need to write a separate proxy to intercept requests.
Writing something like this would be a pretty big project, at least a few solid weeks of work for someone who knew what they were doing, and there's a lot of edge cases.
I run a crawler and I dunno if I agree with this description.
Usually anti-bot measures are in place for a reason, either to protect some paywalled content from scraping, or to prevent bots from getting at forms and causing a mess.
Few webmasters have cloudflare (or similar) anti-bot measures turned up higher than it needs to be (because it 100% will annoy people), and content that isn't a high-value target for blackhat bot activity is rarely difficult to crawl.
SingleFile [0] and SingleFileZ [1] are my tools of choice for archiving pages. Often using an adjusted version of the aardvark-bookmarklet for cutting out unwanted parts.
iOS version is $3/mo or $50 for non-subscription. Cross-device sync via WebDAV, Dropbox or iCloud. Pricy, but they are an indie dev team who have been around for ~20 years, provide email support by competent humans, continuously add features, and most importantly, have been reliable with large databases (e.g. 75GB) on iOS. Web scraper can optionally convert to plaintext, markdown or PDF.
This is a nice project, but I scrolled this site in hope of finding a download link to the "offline content". I hope this project's goals include creating it.
I wonder how big that offline DB will be. PB range?
Also having potentially hundreds servers trying to download entire Internet archive again and again is a recipe for destroying the very resource we so want to preserve. Internet Archive already runs like crap for the majority of time. Imagine how bad it'll get if everyone starts leeching their whole DB.
So in absence of offline archive data this project should have the following features:
- P2p downloading (based on bitorrent perhaps) to avoid redownload ing from the source what can be fetched from online peers.
- ability to mirror live sites and present them (dated) via P2p network where possible.
- ability to buy a hdd with a physical copy of the data.
>>We promise to put your donation to good use as we continue to store over 99 petabytes of data, including 625 billion webpages, 38 million books and texts, and 14 million audio recordings.