Archiving web sites

fake-name · on Nov 23, 2018

Whee, a topic close to my heart!

https://github.com/fake-name/ReadableWebProxy is a project of mine that started out as a simple rewriting proxy, but at this point is basically a self-contained archival system for entire websites, complete with preserved historical versions of scraped content. It has a distributed fetching frontend[1], uses chromium[2] to optionally deal with internet breaking bullshit (Helloooo cloudflare! Fuuuuucccckkkkk yyyyooouuuuuuu), supports multiple archival modes (raw, e.g. not rewritten and destyled, and a rewritten format which makes reading internet text content actually nice), and a bunch of other stuff. The links in fetched content are rewritten to point within the archiver, and if content is not already retreived, it's fetched on-the-fly as you browse.

It also has plugin-based content rewriting features, allowing the complete reformatting of content on-the-fly, and functions as a backend to a bunch of other projects (I run a translated light-novel/web-novel tracker site, and it also does the RSS parsing for that).

I've been occatonally meaning to add WARC forwarding to the frontend, and feed that into the internet archive, but the fetching frontend is old, creaky and brittle (it's some old code), and does a lot of esoteric stuff that would be hard to replicate.

[1]: https://github.com/fake-name/AutoTriever [2]: https://github.com/fake-name/ChromeController

londons_explore · on Nov 23, 2018

Things that attempt to rewrite links and inline css and javascript are doomed to fail. Many sites do wierd javascript shenanigans, and without a million special cases, you'll never make it work reliably. Just try archiving your facebook news feed and let me know how it goes.

Instead, archivists should try to record the exact data sent between the server and a real browser, and then save that in a cache. Then, when viewing the archive, use the same browser and replay the same data, and you should see the exact same thing! With small tweaks to make everything deterministic (disallow true randomness in javascript, set the date and time back to the archiving date so SSL certs are still valid), this method can never 'bit rot'.

When technology moves on, and you can no longer run the browser and proxy, you wrap it all up in a virtual machine, and run it like that. Virtual machines have successfully preserved games consoles data nearly perfectly for ~40 years now, which is far better than pretty much any other approach.

anarcat · on Nov 23, 2018

so the article does go in details about how just "wget-ing" a website isn't sufficient. this is what WARC files are built for, and that's why i insisted on that principle.

but while it's true that some sites require some serious javascript stuff to be archived properly, my feeling is that if you design your site to be uncrawlable, you are going to have other problems to deal with anyways. there will be accessibility problems, and those affect not only "people with disabilities" but also search engines, mobile users, voice interfaces, etc.

if you design your site for failure, it will fail and disappear from history. after all, it's not always the archivist's fault sites die - sometimes the authors should be blamed and held to a higher standard than "look ma, i just discovered Javascript and made a single page app, isn't that great?" :p

megous · on Nov 23, 2018

Communication is dependent on JS events, which are dependent on user's action. There's also localStorage and other such things. Your method might work for some simple JS based websites, but it's no silver bullet.

bicubic · on Nov 23, 2018

While what you say is true, the above method is the only method to archive arbitrary web pages. Yes it depends on user interactions to some extent, but it's possible to reasonably let a page load until fetches stop, and consider it rendered. Generally speaking, you can only archive some preset interactions with a modern web page. You can't hope to capture it all.

There are tools like WebRecorder[0] that do this to some extent by recording and replaying all requests. It's certainly a step in the right direction and demonstrates that the approach is viable. This was the only approach I tried that worked for archiving stuff like three.js demos. Worth mentioning there's also an Awesome list[1] that covers various web archival technologies.

[0] https://github.com/webrecorder/webrecorder

[1] https://github.com/iipc/awesome-web-archiving

nikisweeting · on Nov 24, 2018

What I'm trying to do with Bookmark Archiver is record user activity to create a puppeteer script while the user is visiting the page, then replay that on a headless browser later and record the session to a WARC. That should cover both dynamically requested and interactive stuff that is otherwise being missed by current archiving methods.

I also plan on saving an x86 VM image of the browser used to make the archive every couple months so that sites can be replayed many decades into the future.

AlphaWeaver · on Nov 23, 2018

In the realm of scraping and page archiving, I'd like to note a library I found useful recently, called `freeze-dry` [0][1]. It packages a page into a SINGLE HTML file, inlining relevant styles. The objective is to try and replicate the exact look and structure of the page instead of all the interactive elements. Very useful for building a training dataset for any algorithms that read web pages.

[0]: https://www.npmjs.com/package/freeze-dry

[1]: https://github.com/WebMemex/freeze-dry

kevingrahl · on Nov 22, 2018

I’ve been using

    wget -E -H -k -K -nd -N -p -P pageslug URL

for some time now and never had any issues with it. I created a .bash_aliases entry so that now I only have to type

    war pageslug URL

to archive some website.

I haven’t archived to many websites (I focus more on media files like videos, ebooks and such) that’s probably why I haven’t run into any issues yet but I’d be interested if somebody has a link that doesn’t work with this method just so that I can see what the result would be like.

Here’s an explanation of the method I use for anyone interested: https://gist.github.com/dannguyen/03a10e850656577cfb57

ivarv · on Nov 22, 2018

Archiving one's web browsing trail seems to be a common use case. Here are some promising related projects that have been on HN:

* https://github.com/pirate/bookmark-archiver

* https://getpolarized.io/

danso · on Nov 22, 2018

I've got a set of small repos of small government sites I've used a combination of `wget` and `curl` and other shell commands to snapshot, mostly so I can have a reliable mirror when teaching web scraping: https://github.com/wgetsnaps

But as the submitted article points out, archiving the Web is much trickier these days, and wget is no longer sufficient for anything relatively modern. I've been impressed with what Internet Archive has seemingly been able to do, and I've been interested whether it's the result of improved techniques on their side, or of certain sites following a standard that happens to make them more archivable.

For example, 538's 2018 election trackers are very JS dependent, yet IA has managed to capture them in a way that not only preserves the look and content, but keeps their widgets and visualizations almost entirely fully functional:

https://web.archive.org/web/20181102125134/https://projects....

However, even the excellent archive of 538's site shows a huge weakness in IA's efforts: IA (quite understandably) aggressively caches a site's dependencies, such as external JS and JSON data files. If you scroll down the 538 example posted above, you'll see that despite being a snapshot on Nov. 2, 2018, many of its widgets only contain data from the last time IA fetched its external dependencies, which appears to be August 16, 2018.

unicornporn · on Nov 22, 2018

This post doesn't mention the Webrecorder Player[1], which is a GUI app that displays WARC files. It's probably the easiest way to view Web ARChives.

For those willing to set up a docker container, check out Warcworker[2].

[1] https://github.com/webrecorder/webrecorder-player

[2] https://github.com/peterk/warcworker

anarcat · on Nov 23, 2018

I have dietary restrictions against Electron apps which is why I didn't mention Webrecorder player. Besides, last time I suggested anything touching node.js to LWN readers, the answer was a clear "nope" so I tend to thread carefully there as well. ;)

dewey · on Nov 22, 2018

They link to the webrecorder repositories in the second paragraph though, including https://github.com/webrecorder/pywb

batuhanicoz · on Nov 22, 2018

Disclaimer: My company works with Teyit and I've built the archiving product. Also: This is a shameless plug.

Teyit.org[0], the biggest fact-checking organization in Turkey, has their own archiving site called teyit.link[1].

It's a non-profit organization and they automatically archive any link that's sent to them via their site, Twitter, Facebook etc.. It's also usable by the public.

It's open source on GitHub[2] and we've actually been developing a new version[3] and have a plan to add `youtube-dl` along with WARC.

[0] https://teyit.org

[1] https://www.teyit.link

[2] https://github.com/teyit/teyitlink-web

[3] https://github.com/noddigital/teyit.link

patrickyeon · on Nov 22, 2018

gwern has a very involved post on archiving as well, https://www.gwern.net/Archiving-URLs

Somewhere on my to-do list is archiving everything I visit on the internet. It's frustrating to know that I've seen something, but be unable to find it again.

gildas · on Nov 22, 2018

For this, you could use SingleFile [1], an extension for Chrome and Firefox. It can auto-save pages.

[1] https://github.com/gildas-lormeau/SingleFile

pflanze · on Nov 22, 2018

I should probably try this. I wonder how it will compare with what I'm doing:

I simply use the browser's (Firefox) save page feature (ctl-s). Then every now and then, I convert the folder with these pages to a squashfs image (which de-duplicates all the CSS, JS, image files that are saved multiple times). I then use shell tools to search (ls grep locate etc.). This doesn't save the URL, but I also maintain a private "bookmarks" Git repository for more interesting bits where each bookmarked resource gets its own file (in a descriptive hierarchy), along with thoughts and notes.

What works well is that I am somewhat selective, complete trash doesn't end up being archived, it's pretty space efficient. It's also simple to wade through (in the shell), each saving action is just a file and folder pair. Also often I use the reader mode Firefox feature, and then save that (i.e. I save what I saw, not what was delivered). What doesn't work so well is that saving the page is often a bit of a hassle, often Firefox reports save errors then I just ctl-s again to see if the previous attempt succeeded, if not, hit the reload icon in the download task list which apparently forces it. Also, that it saves embedded videos, haven't figured out yet how to disable that, so periodically I remove videos again. Also, when saving multiple pages from the same site, links to other pages don't go to the local mirror. (In cases where it's important, I use wget -m -k.)

gildas · on Nov 23, 2018

I think the differences would be the following ones:

- You wouldn't have to bother with JS files, they are removed from saved pages by default.

- You couldn't (easily) de-duplicate resources because they are embedded in base64 into the page. However, SingleFile can detect all the hidden elements and unused CSS rules/declarations (by computing the cascade). So the HTML and CSS are optimized. It is also able to group duplicate images (by using CSS custom properties). There are other options to make the document size as small as possible. Most of the time, pages saved with SingleFile are smaller than Chrome MHTML files.

- The saving process would be much more simple, reliable and could be automatic.

- Videos wouldn't be saved (by default), a snapshot of the video would replace each video.

- Maybe the "filename template" in SingleFile would help you to organize things.

victor106 · on Nov 23, 2018

SingleFile does not seem to embed images in the file. I tried this[1] and it seems to work fine. It seems to be closed source though..:(

[1]https://chrome.google.com/webstore/detail/save-as-mht/hfmodl...

gildas · on Nov 23, 2018

It does embed images (and all other resources). Can you post a URL showing this issue please?

victor106 · on Nov 23, 2018

I used it on an internal website and when I opened it, it didn't have the image in it. In place of images I only see blank place holders. But it did preserve the structure of the website.

gildas · on Nov 23, 2018

If you see JS errors related to the extension or HTTP errors in the console of the developer tools, please file an issue with the errors on GitHub. It would help understanding what's going wrong.

EDIT: it could be related to the fact that SingleFile is maybe more strict regarding the HTTP header "Content-Type". For example, it will discard images with "text/html" as content type value.

victor106 · on Nov 23, 2018

Tried it again on a few other pages and it works most of the time. I am not sure why it fails on some. I will try to debug this and create a pr

gildas · on Nov 23, 2018

If you can find a public URL showing this issue, don't bother with debugging it. Just post the URL on GitHub.

vackosar · on Nov 22, 2018

tool for domain to epub conversio https://github.com/haroldtreen/epub-press-clients/blob/maste...

vackosar · on Nov 22, 2018

unfortunately it requires a lot of permissions. I try to minimize exts like this

gildas · on Nov 22, 2018

I really did my best to minimize the APIs used by the extension. Note that Chrome 70 allows you to restrict extensions by host [1].

[1] https://blog.chromium.org/2018/10/trustworthy-chrome-extensi...

toomuchtodo · on Nov 22, 2018

I use grab site for extensive archiving operations. It’s not an extension, but trivial to launch from a command line (and I stuff the data into a Backblaze B2 account for later Internet Archive ingestion).

You could take this and post process your internet history on a rolling basis to accomplish your goal.

https://github.com/ludios/grab-site

anarcat · on Nov 23, 2018

I toyed with that idea a few times, but this now strikes me as a problem for two reasons:

1. it's a security liability. some content i load in a web browser is private and I don't want to archive or duplicate it anywhere

2. it means a lot of crap. part of archiving content is curating what gets archived and what doesn't. i didn't touch on this in the article, but it's a key idea archivists need to address. for example, archiving the page I'm typing in right now (news.y.com/reply) makes no sense at all because it's solely dynamic content and would mean nothing when browsed later.

So instead I send specific links I want to keep in my bookmarks system, as I mentioned in the article. It's far from ideal, but it's a much better compromised than archiving every time I visit my weather service page. ;)

namibj · on Nov 23, 2018

You are not alone. It's further back on said to-do list, as it appears to be a problem with no easy existing solution, and it's too big of a project for the immediate payoff.

If you happen to stumble upon a solution (obviously self-hosted/local, due to the unfiltered access to page content), I might be willing to contribute with configuration scripting or client/server splitting so that the bulk doesn't have to stay on e.g. a laptop.

burtonator · on Nov 23, 2018

I think you guys might like this personal web archival tool I launched about a month ago:

https://getpolarized.io/

It's basically an offline browser where you can capture full HTML pages locally including the iframes, and tag and annotate the content.

I should have cloud sync support in the next release (1-2 weeks) which will allow you to keep your data in the cloud and sync it between machines. Initially it will just support Firebase but I have plans to support other cloud providers via plugins.

I'd also like to support end to end encryption so that you don't have to worry about people reading your data.

There's a huge Hacker News about Polar here:

https://news.ycombinator.com/item?id=18219960

A semi-requested feature is full recursive archival of content but I don't think we're going to go in that direction. Instead I think we're going to support pasting or importing a list of URLs.

Many documentation sites have an 'index' of Table of Contents and this way I can just fetch and store all those URLs without over-fetching.

My background is search and I built a petabyte scale search service named Datastreamer (http://www.datastreamer.io/). I'm also one of the inventors of RSS - so I have a lot of ideas on the roadmap here.

It also supports PDFs, text and area highlights, comments, flashcards and sync with Anki.

The initial response after our release has been amazing. The user base is really engaged with thousands of monthly active users and contributors.

Anyway. Take it for a spin. It's free and Open Source.

frontier · on Nov 23, 2018

I recently had to do this and after a lot of frustration with wget, httrack and some other commercial ones too, I ended up settling on the results of this free product, WebCopy.

https://www.cyotek.com/cyotek-webcopy

Background: We couldn't keep the existing platform running, so had to transition to static html files.

I used the WebCopy scan log to create the apache rewrite rules to preserve the existing link structure.

Where I say WebCopy was better, it was this simple log, but also the file structure it was producing was much cleaner with less junk pages and duplicates. (the site was an absolute inconsistent mess to begin with)

frontier · on Nov 23, 2018

I was surprised to not find any product that could create a perfect static clone of the original, as far as maintaining the incoming link structure.

I know there would be a tonne of edge cases and obviously it would need to be targeted to a particular platform, but I think we came pretty close with this simple technique.

jnurmine · on Nov 23, 2018

I am of the opinion that anything you might need to read two-three times over a longer period of time should be copied locally, or to some service, for later retrieval.

I use Pocket a lot, and "share" into it from different devices.

But sometimes just do "wget --mirror -np -k --limit-rate=10k https://interesting.stuff.here.org/" on a PC.

I'd actually like a self-hosted variant of Pocket, but have not really researched if those even exist. Anyone with a suggestion?

nikisweeting · on Nov 23, 2018

Thanks for the mention of bookmark-archiver! WARC support has been high on my list for a long time, but unfortunately, I have a day job that keeps me super busy.

Also, the author, Antoine Beaupré is also an engineer living in Montreal who works on mesh networking stuff, are we the same person?! I just sent him an email to make sure it doesn't land in my own inbox...

anarcat · on Nov 23, 2018

Hi Nick! Yes, I am the same person as ... er... myself if that makes any sense at all. :) I'll followup by email, thanks for reaching out!

anarcat · on Nov 23, 2018

author here. AMA.

since I wrote this, i started experimenting with grab-site:

https://github.com/ludios/grab-site

it's a wrapper around (and a fork of) wpull but the main advantage over it is it can do on-the-fly reconfiguration of delay, concurrency, ignore patterns and so on. it also provides a nice web interface. if you're only crawling one site every once in a while, wpull and crawl are fine, but for larger projects, grab-site is a must.

shakna · on Nov 23, 2018

I was working on an archiving tool a little while back, though I haven't touched it recently.

It would recursively convert a page into a single URI. Chrome seems to have a limit for URLs, but Firefox doesn't, so far as I can tell.

Copy the contents of [0] into your URL bar, and you'll see not just the page, but the Python script is embedded in it too. (It's a bit long to dump onto a forum page).

[0] https://shakna.keybase.pub/offlineweb

ernsheong · on Nov 23, 2018

Those desiring something simple can check out https://www.pagedash.com/. Note: requires login and saves pages to the cloud.

(Disclaimer: I am the maker of PageDash)

kev009 · on Nov 22, 2018

I use a wget invocation similar to the one listed for ps-2.kev009.com. I've recently used HTTRack in a few places where that had issues and was impressed with it as well.

alwaysreading · on Nov 22, 2018

Love wget! The waybackmachine is a great tool but I wish there was a more robust/complete service out there. Maybe the government is/will archive the top million sites or something like that.