Show HN: Pocket Stream Archive – A personal Way-Back Machine

vijucat · on May 5, 2017

I still use Firefox ScrapBook for this: https://addons.mozilla.org/en-US/firefox/addon/scrapbook/

Just for articles, mind you, not entire websites.

cJ0th · on May 5, 2017

Ditto. I just hope the transition to web extension will be smooth or at least happening at all.

jl6 · on May 5, 2017

Screenshotting or PDFing of a website is an increasingly important archiving tool, to supplement wget. I've come across a lot of websites that won't render any content if not connected to a live server.

nikisweeting · on May 5, 2017

I couldn't agree more. I wish more sites would load without needing multiple seconds of JS execution and AJAX. One of my TODOs is to get full-page screenshots working as well.

zulln · on May 5, 2017

> One of my TODOs is to get full-page screenshots working as well.

https://medium.com/@dschnr/using-headless-chrome-as-an-autom...

tomcooks · on May 5, 2017

You might find this useful if you use Firefox:

- shift+f2

- screenshot urcodiaz.png --fullpage

Drdrdrq · on May 5, 2017

Thank you for this tip! I have always installed Abduction plugin if I needed this, but the less plugins the better.

AznHisoka · on May 5, 2017

in some cases it leads to worse horrible experiences like my Linkedin feed. I have to wait a few seconss everytime before content loads.

driverdan · on May 5, 2017

PDFs are really not suitable for archiving websites since they're designed around pages and the web does not have pages.

A better option is to render a page with JS turned on and save the resulting HTML.

nikisweeting · on May 5, 2017

I was doing this before with the chrome --dump-dom flag but the output I was getting was garbage, and no more useful than the simpler wget download. PDF turned out to produce really nice, readable archives about 75% of the time, so I kept it in. Text-based sites tend to do a good job of having PDF-friendly styling.

ianbicking · on May 6, 2017

I've been trying to do something like this in this project: https://github.com/ianb/pagearchive

It came out of a screenshot/archiver I've been working at at Mozilla, but I've split it up as the screenshotting is shipping and DOM archiving is still way outside Mozilla's comfort zone.

zulln · on May 5, 2017

I googled it, and --dump-dom simply dumps the output of document.body.innerHTML. I have not used headless chrome at all, but I imagine it would not be that hard to get it to dump document.documentElement.outerHTML instead.

(execute JavaScript on the page is most likely possible?)

TazeTSchnitzel · on May 5, 2017

PDFs have the advantage of being a fixed format that should display the same everywhere. It's probably less fragile than modern CSS or HTML.

However, as you point out, PDFs are designed around the printed page, not the flowing arbitrary-page-size documents of the web.

robzyb · on May 5, 2017

Wouldn't a copy of the DOM be even better than a screenshot?

I.e. DOM copy > screenshot > wget?

nikisweeting · on May 5, 2017

I mentioned it below, but I tried getting DOM snapshots using chrome --dump-dom, but the output usually didn't render well without a <head> section. (chrome only outputs the <body>) I could attach the headers from the wget file... but then it starts getting messy and complicated.

jccalhoun · on May 5, 2017

Agreed. I research new media and archive.org is invaluable to me. I worry that current web sites won't be able to be preserved. (much like many of the flash sites and real audio of the past are largely gone.)

nikisweeting · on May 5, 2017

Speaking of audio, one of my highest priority TODO's for this project is to use youtube-dl for nicely archiving youtube & soundcloud links.

frik · on May 5, 2017

But what do you do when a website has a broken media query, essentially destroying the print layout? Then a PDF is useless.

Well, I took a screenshot, better than nothing.

avian · on May 5, 2017

What version of Google Chrome do you need for the PDF export to work? I tried it on 58.0.3029.96 (Linux) and this does nothing (no error messages, it just quits without writing any files):

$ google-chrome --headless --disable-gpu --print-to-pdf 'http://example.com'

Edit: I'm completely baffled that such widely used software as Google Chrome can have this written in the man page: "Google Chrome has hundreds of undocumented command-line flags that are added and removed at the whim of the developers."

nikisweeting · on May 5, 2017

59 or later, --headless is a brand new feature. apt-get install google-chrome-canary.

https://developers.google.com/web/updates/2017/04/headless-c...

edibleEnergy · on May 5, 2017

This is the only place I've found them parsed and documented: http://peter.sh/experiments/chromium-command-line-switches/

toomuchtodo · on May 6, 2017

Highly recommend switching to wpull (https://github.com/chfoo/wpull), which was built as a wget replacement. It's what grab-site uses, which is a successor to ArchiveTeam's ArchiveBot.

"grab-site is made possible only because of wpull, written by Christopher Foo who spent a year making something much better than wget. ArchiveTeam's most pressing issue with wget at the time was that it kept the entire URL queue in memory instead of on disk. wpull has many other advantages over wget, including better link extraction and Python hooks."

nikisweeting · on May 7, 2017

This looks awesome, thanks for the suggestion! It'll help with WARC support as well, looks like it can output WARCs with just a cli flag.

throw98987 · on May 5, 2017

Use zotero and you have your own personal Pocket with snapshots. In addition, you can add tags, organize stuff into folders, etc. https://www.zotero.org/

nikisweeting · on May 5, 2017

Zotero is awesome! It doesn't provide a publishable stream of recently added articles though afaik.

bovine3dom · on May 5, 2017

You can, if you faff a little bit [0]. Unless you mean something else by publishable?

[0] https://www.zotero.org/support/groups#public_open_membership

ticoombs · on May 5, 2017

I've been running wallabag[1] for my own pocket instance. It's been running perfectly for a couple years.

Also has a pocket import feature.

[1] https://wallabag.org/en

antman · on May 5, 2017

The demo does not have images. Maybe try

wget -nc -np -E -H -k -K -p -U 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.6) Gecko/20070802 SeaMonkey/1.1.4' -e robots=off

nikisweeting · on May 5, 2017

I opted not to download images using wget. I figured if I needed in-article images the PDF+screenshot would be enough.

unicornporn · on May 5, 2017

If it would be connected to Pinboard.in as an alternative to Pocket I would be screaming with joy. :-)

8ig8 · on May 5, 2017

Pinboard offers an archive feature:

https://pinboard.in/upgrade/

nikisweeting · on May 5, 2017

If you can get me a sample pinboard export to look at, I'll whip up a regex that makes it work.

unicornporn · on May 5, 2017

Oh, thanks!

There's three backup formats: XML (same as Delicious v1 API), HTML (legacy Netscape format almost everyone can read) and JSON.

I can get back with an XML file later. There's also an API that might be of interest.

Here's someone who blogged about that:

http://behindcompanies.com/2011/12/a-guide-to-backing-up-pin...

The question was asked if you can use your API endpoint URL (https://[pinboardusername]:[pinboardpassword]@api.pinboard.i...) straight into ifttt. Maciej confirmed that it should work, the problem is that you’re essentially storing your login credentials in a 3rd party service, and you don’t know if they’re storing and transmitting it securely.

unicornporn · on May 5, 2017

Here's what the XML files look like:

https://gist.github.com/anonymous/316b9fd1108b026e116e91f134...

joshstrange · on May 5, 2017

Shouldn't be too terribly difficult to modify since all Pocket does is provide a list of URL's same as pinboard.

snackai · on May 5, 2017

There is a PR with pinboard JSON support now.

djhworld · on May 5, 2017

This is cool.

Could you add an option to either add tagging, or separate the tagged items into folders?

e.g. "programming/", "docker/" etc, I often find myself digging through my Pocket archive trying to find that one article I found 6 months ago and it gets incredibly annoying

nikisweeting · on May 5, 2017

I like having the sites by timestamp because they're guaranteed to be unique, and it makes traversing them easy. I'd be happy to add a tag column to the index though, which you could use with Ctrl+F to find articles. https://github.com/pirate/pocket-archive-stream/issues/1

anc84 · on May 5, 2017

Now if only Chromium could learn to write WARC archives, then it would be on par! :)

Great project!

rcarmo · on May 5, 2017

I've been thinking along those very same lines for a long time (this project makes me wish for more free time).

I have half a mind to fork this and add something like https://github.com/internetarchive/warcprox, or at the very least walk through the generated HTML and brute-force inline all assets as a first pass :)

throwaway7767 · on May 5, 2017

I've been thinking I'd love to have a WARC archive of all my browsing. So many times sites I remember seeing have gone offline, and didn't get archived by the big services. Ideally this has to happen with browser cooperation, so it can save resources from complex dynamic pages, including responses to user action.

This must happen either in the browser or in a proxy like the linked warcprox, in order to catch everything. But the proxy solution is getting less practical every day with key pinning and HSTS.

Maybe a future firefox will have an option to export everything to WARC?

nikisweeting · on May 5, 2017

I would be very on-board with adding a warc exporting option. I also hate how Chrome tosses all history older than 3 months. Running an archiving proxy hooked up to archive.py would kill both birds with one stone.

motdiem · on May 5, 2017

Can one automate extensions through headless chrome ? then you might be able to trigger WarCreate instead (It will be more efficient to run the pocket export urls through WAIL though - this should give you the warcs you want)

nikisweeting · on May 5, 2017

Yeah, you can use the remote debugger protocol to make JS calls in the context of the page. https://chromedevtools.github.io/devtools-protocol/

Not sure if it's worth including in my script though, since WARCs aren't easily browseable (correct me if I'm wrong).

MrRadar · on May 5, 2017

> Not sure if it's worth including in my script though, since WARCs aren't easily browseable (correct me if I'm wrong).

I've had good luck with WebArchivePlayer: https://github.com/ikreymer/webarchiveplayer

zmix · on May 5, 2017

I didn't try myself, but I did a quick search and found this: http://www.archiveteam.org/index.php?title=The_WARC_Ecosyste...

rcarmo · on May 5, 2017

Well, no, but on a Mac .webarchives are Spotlight indexable and make for a nice single-file archiving approach, and I might actually have some old code that tries to convert between the two...

nikisweeting · on May 5, 2017

Oh sweet, that's pretty nice.

rcarmo · on May 5, 2017

Also, I've been hacking away at http://github.com/rcarmo/newsfeed-corpus for a bit. Hadn't thought of doing archival on everything, but having this tied in seems like a logical step ;)

rcarmo · on May 5, 2017

Oh, and kudos for the awesome GitHub username :)

frik · on May 5, 2017

Or EML/MHT. It's the format the email programs use to store the HTML mail incl all pictures, JS, CSS, ... in one plain text file. IE 9-11 also supports that format (file -> save as...) but calls it MHT?

arkenflame · on May 6, 2017

I wrote a Chrome extension that similarly saves copies of pages you bookmark: https://chrome.google.com/webstore/detail/backmark-back-up-t...

fiatjaf · on May 5, 2017

You see something is flawed in Redux at the point you have to pass strings (uppercase constants defined somewhere) around, import them in every file, pass them as identifiers of what you should do with each data.

Strings!

nikisweeting · on May 5, 2017

Did you comment on the wrong article by accident? https://news.ycombinator.com/item?id=14273549

fiatjaf · on May 5, 2017

Yes!

Thank you.

bicubic · on May 5, 2017

This seems neat, curious what are the use cases for this?

nikisweeting · on May 5, 2017

Slowing down the inevitable tide of https://en.wikipedia.org/wiki/Link_rot. When I cite blog posts or want to share sites that have gone down, I can swap out the links for my archived versions.

bicubic · on May 5, 2017

Why is archive.org or one of the other centralized web archives not suitable for that? They don't index the content you want to retain?

anc84 · on May 5, 2017

Never rely on others to care about things you need.

nikisweeting · on May 5, 2017

Archive.org is a single-point-of-failure, also they don't take PDFs and screenshots of fully-rendered sites with their JS-loaded content.

ents · on May 5, 2017

Would be cool to see this for Instapaper or Pinboard

nikisweeting · on May 5, 2017

My script should work with very minimal tweaking if you can get a list of urls + titles from those services.

Just one line of regex changes probably: https://github.com/pirate/pocket-archive-stream/blob/master/...

no_news_is · on May 5, 2017

Just about what I was hoping to see in the comments, was the expected format of the input.

May I suggest including in the readme.md, a sample line of the Pocket export format/your input format?

Thanks for publishing!

nikisweeting · on May 5, 2017

I included a sample of the expected pocket list format in the repo: https://github.com/pirate/pocket-archive-stream/blob/master/...

And a comment next to the regex for parsing it: https://github.com/pirate/pocket-archive-stream/blob/master/...

rcarmo · on May 5, 2017

This is great. All it needs is a Docker container and I'd be running it now (need to take some time aside this weekend to do that).

burnbabyburn · on May 5, 2017

this is really cool! I always had in mind a project where you save every page you visit, and somehow expose them in the future to know what you visit and maybe remembering you important stuff based on some heuristic.

anotheryou · on May 5, 2017

yes! thank you so much. Needed this badly.

nikisweeting · on May 5, 2017

You're welcome! I've been wanting to build this for ages but headless chrome finally inspired me to actually finish it.