Dweb: Offline Internet Archive

causi · on Jan 9, 2023

Archiving data from the Internet Archive is useful, but I would argue there's a greater need going unfulfilled: local direct site archiving. Twenty years ago you could punch a URL into HTTRACK and get a local site copy that was indistinguishable from the live version. As far as I know there's nothing like that available today. The closest methods I know involve stringing together four or five entirely separate utilities to get one saved site.

phiresky · on Jan 9, 2023

https://github.com/webrecorder/browsertrix-crawler works pretty well. Same tech as https://archiveweb.page/ but non-interactive

marban · on Jan 9, 2023

Cloudflare has more or less killed white-hat crawling en large. On top, you have sites like Bloomberg where even solutions like Puppeteer-stealth will fail at some point.

xwolfi · on Jan 9, 2023

And Selenium with clever clicking ? I never got why crawlers never took the approach to just be a slow human clicking around chrome.

traverseda · on Jan 9, 2023

Anyone looking into this, I'd recommend python-playwright. Good browser engines and the ability to hot patch every request means you don't need to write a separate proxy to intercept requests.

Writing something like this would be a pretty big project, at least a few solid weeks of work for someone who knew what they were doing, and there's a lot of edge cases.

marginalia_nu · on Jan 9, 2023

I run a crawler and I dunno if I agree with this description.

Usually anti-bot measures are in place for a reason, either to protect some paywalled content from scraping, or to prevent bots from getting at forms and causing a mess.

Few webmasters have cloudflare (or similar) anti-bot measures turned up higher than it needs to be (because it 100% will annoy people), and content that isn't a high-value target for blackhat bot activity is rarely difficult to crawl.

suramya_tomar · on Jan 11, 2023

Well, wget still works and allows you to mirror a website with a single command The following command mirrors a given website:

wget --mirror --convert-links --adjust-extension --page-requisites --no-parent https://site-to-download.com

So not sure what you mean by there are no tools available to mirror a website.

vorpalhex · on Jan 9, 2023

ArchiveBox

https://github.com/ArchiveBox/ArchiveBox

causi · on Jan 9, 2023

If I'm reading this right, ArchiveBox is intended for archiving individual web pages, not entire sites.

Brajeshwar · on Jan 9, 2023

Not to archive, but this was my way to learn HTML/CSS to build websites in 1999-2000.

taffit · on Jan 9, 2023

SingleFile [0] and SingleFileZ [1] are my tools of choice for archiving pages. Often using an adjusted version of the aardvark-bookmarklet for cutting out unwanted parts.

[0] https://chrome.google.com/webstore/detail/singlefile/mpiodij...

[1] https://chrome.google.com/webstore/detail/singlefilez/offkdf...

walterbell · on Jan 9, 2023

On iOS/macOS, DevonThink includes a web scraper with full-text local search.

janandonly · on Jan 9, 2023

This piqued my interest until I saw the $99 price.

walterbell · on Jan 9, 2023

iOS version is $3/mo or $50 for non-subscription. Cross-device sync via WebDAV, Dropbox or iCloud. Pricy, but they are an indie dev team who have been around for ~20 years, provide email support by competent humans, continuously add features, and most importantly, have been reliable with large databases (e.g. 75GB) on iOS. Web scraper can optionally convert to plaintext, markdown or PDF.

aliqot · on Jan 9, 2023

I just used HTTrack last week, it still works fine

Roark66 · on Jan 9, 2023

This is a nice project, but I scrolled this site in hope of finding a download link to the "offline content". I hope this project's goals include creating it.

I wonder how big that offline DB will be. PB range?

Also having potentially hundreds servers trying to download entire Internet archive again and again is a recipe for destroying the very resource we so want to preserve. Internet Archive already runs like crap for the majority of time. Imagine how bad it'll get if everyone starts leeching their whole DB.

So in absence of offline archive data this project should have the following features: - P2p downloading (based on bitorrent perhaps) to avoid redownload ing from the source what can be fetched from online peers. - ability to mirror live sites and present them (dated) via P2p network where possible. - ability to buy a hdd with a physical copy of the data.

nix23 · on Jan 9, 2023

>I wonder how big that offline DB will be. PB range?

I theory ~ 99PB

https://archive.org/donate/

>>We promise to put your donation to good use as we continue to store over 99 petabytes of data, including 625 billion webpages, 38 million books and texts, and 14 million audio recordings.

1vuio0pswjnm7 · on Jan 9, 2023

NodeJS is a prerequisite?

ghusto · on Jan 9, 2023

Heh, yeah I clicked back as soon as I saw that too.

It's an easy quality measure for me.

loxias · on Jan 9, 2023

> It's an easy quality measure for me.

Agreed.

It's also got YAML!

At least it includes a Dockerfile with all the dependencies (I assume).

djmips · on Jan 9, 2023

Can you explain how node.js indicates (I presume) low quality?

mdaniel · on Jan 10, 2023

I'm hyper aware that one can write spaghetti in any language, but node and its ecosystem just seems to embrace it; let's just consider the entrypoint, for example: https://github.com/internetarchive/dweb-mirror/blob/0.2.90/i...