Hacker News new | past | comments | ask | show | jobs | submit login
Dweb: Offline Internet Archive (github.com/internetarchive)
94 points by sleepchaser on Jan 9, 2023 | hide | past | favorite | 22 comments



Archiving data from the Internet Archive is useful, but I would argue there's a greater need going unfulfilled: local direct site archiving. Twenty years ago you could punch a URL into HTTRACK and get a local site copy that was indistinguishable from the live version. As far as I know there's nothing like that available today. The closest methods I know involve stringing together four or five entirely separate utilities to get one saved site.


https://github.com/webrecorder/browsertrix-crawler works pretty well. Same tech as https://archiveweb.page/ but non-interactive


Cloudflare has more or less killed white-hat crawling en large. On top, you have sites like Bloomberg where even solutions like Puppeteer-stealth will fail at some point.


And Selenium with clever clicking ? I never got why crawlers never took the approach to just be a slow human clicking around chrome.


Anyone looking into this, I'd recommend python-playwright. Good browser engines and the ability to hot patch every request means you don't need to write a separate proxy to intercept requests.

Writing something like this would be a pretty big project, at least a few solid weeks of work for someone who knew what they were doing, and there's a lot of edge cases.


I run a crawler and I dunno if I agree with this description.

Usually anti-bot measures are in place for a reason, either to protect some paywalled content from scraping, or to prevent bots from getting at forms and causing a mess.

Few webmasters have cloudflare (or similar) anti-bot measures turned up higher than it needs to be (because it 100% will annoy people), and content that isn't a high-value target for blackhat bot activity is rarely difficult to crawl.


Well, wget still works and allows you to mirror a website with a single command The following command mirrors a given website:

wget --mirror --convert-links --adjust-extension --page-requisites --no-parent https://site-to-download.com

So not sure what you mean by there are no tools available to mirror a website.



If I'm reading this right, ArchiveBox is intended for archiving individual web pages, not entire sites.


Not to archive, but this was my way to learn HTML/CSS to build websites in 1999-2000.


SingleFile [0] and SingleFileZ [1] are my tools of choice for archiving pages. Often using an adjusted version of the aardvark-bookmarklet for cutting out unwanted parts.

[0] https://chrome.google.com/webstore/detail/singlefile/mpiodij...

[1] https://chrome.google.com/webstore/detail/singlefilez/offkdf...


On iOS/macOS, DevonThink includes a web scraper with full-text local search.


This piqued my interest until I saw the $99 price.


iOS version is $3/mo or $50 for non-subscription. Cross-device sync via WebDAV, Dropbox or iCloud. Pricy, but they are an indie dev team who have been around for ~20 years, provide email support by competent humans, continuously add features, and most importantly, have been reliable with large databases (e.g. 75GB) on iOS. Web scraper can optionally convert to plaintext, markdown or PDF.


I just used HTTrack last week, it still works fine


This is a nice project, but I scrolled this site in hope of finding a download link to the "offline content". I hope this project's goals include creating it.

I wonder how big that offline DB will be. PB range?

Also having potentially hundreds servers trying to download entire Internet archive again and again is a recipe for destroying the very resource we so want to preserve. Internet Archive already runs like crap for the majority of time. Imagine how bad it'll get if everyone starts leeching their whole DB.

So in absence of offline archive data this project should have the following features: - P2p downloading (based on bitorrent perhaps) to avoid redownload ing from the source what can be fetched from online peers. - ability to mirror live sites and present them (dated) via P2p network where possible. - ability to buy a hdd with a physical copy of the data.


>I wonder how big that offline DB will be. PB range?

I theory ~ 99PB

https://archive.org/donate/

>>We promise to put your donation to good use as we continue to store over 99 petabytes of data, including 625 billion webpages, 38 million books and texts, and 14 million audio recordings.


NodeJS is a prerequisite?


Heh, yeah I clicked back as soon as I saw that too.

It's an easy quality measure for me.


> It's an easy quality measure for me.

Agreed.

It's also got YAML!

At least it includes a Dockerfile with all the dependencies (I assume).


Can you explain how node.js indicates (I presume) low quality?


I'm hyper aware that one can write spaghetti in any language, but node and its ecosystem just seems to embrace it; let's just consider the entrypoint, for example: https://github.com/internetarchive/dweb-mirror/blob/0.2.90/i...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: