Hacker News new | past | comments | ask | show | jobs | submit login

> 'Save the webpage as I see it in my browser' remains a surprisingly annoying and fiddly problem

Is it really? I remember hacking around with with JavaScript's XMLSerializer (I think) like 5 years ago and solved that for ~90% of the websites I tried to archive. It'd save the DOM as-is when executed.

Internet Archive/ArchiveTeam also worked on that particular problem for a very long time, and are mostly successful as far as I can tell.




90% feels like an overestimate to me but it's already quite poor, you wouldn't accept that for saving most other things. Another problem is highlighted in the piece - it's a hassle to ensure external tools handle session state and credentials. Dynamic content is poorly handled, the default behaviours are miserable (a browser will run random Javascript from the network but not Javascript you've saved, etc).

There's a lot of interest in 'digital preservation' and perhaps one sign of how it's very much early days of the field - it's tricky to 'just save' the results of one of the most basic current computer interactions - looking at a web page.


But if you serialize the DOM as-is, you literally get what you see on the page when you archive it. Nothing about it is dynamic, and there is no sessions nor credentials to handle. Granted, it's a static copy of a specific single page.

If you need more than that, then WARC is probably the best. For my measly needs of just preserving exactly what I see, serializing the DOM and saving the result seems to do just fine.


Yes you save something that's mildly better than print-page-to-PDF. But it still misses things and the interactive stuff is very much part of 'exactly what I see'. Like, a random article with an interactive graph, for instance - like this recent HN hit https://ciechanow.ski/airfoil/

It's not that there aren't workarounds, it's that they are clunky and 'you can't actually save the most common computery entity you deal with' is just a strange state of affairs we've somehow Stockholmed ourselves to.


> Internet Archive/ArchiveTeam also worked on that particular problem for a very long time, and are mostly successful as far as I can tell.

One category that the archivers do poorly with is news articles where a pop-up renders on page load which then requires client-side JS execution to dismiss the pop-up.

Sometimes it is easily circumvented by manual DOM manipulation, but that's hardly a bulletproof solution. And it feels automateable.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: