Show HN: Self-hosted offline Internet from your browsing history

wooptoo · on Nov 11, 2020

Isn't this how the internet was supposed to work in the first place? I remember Netscape navigator having a 'go offline' icon in the corner.

reaperducer · on Nov 11, 2020

I remember that button, too, but I think it had more to do with connection charges than caching.

In Netscape days, many people would have to pay by the minute to be connected to the internet. In those days, web pages generally contained far more information than they do now, and were less interactive. So you'd connect, load the content you wanted to see, disconnect, and then just sit there and read it for free, instead of bleeding cash.

rusk · on Nov 11, 2020

Back in the nineties the web was fairly new and people still used a thing called usenet quite a bit. You interacted with it kind of like email (Google groups is actually the final vestiges of it) - and the go offline button would just download all your emails and newsgroups and you got peruse them offline at your pleasure. It might seem strange also that back in those days you downloaded your emails from a server using POP3 rather than looking at them remotely (e.g. Web or IMAP), and you viewed them offline.

account42 · on Nov 12, 2020

IMAP can still be used to synchronize an offline copy, and it makes sense to do so in case your provider fucks you over.

kjaftaedi · on Nov 13, 2020

News servers don't use POP3. They serve articles over NNTP.

rusk · on Nov 13, 2020

> back in those days you downloaded your emails from a server using POP3

I would have said UUCP for news but that's showing my age.

derefr · on Nov 11, 2020

I'm not sure that was the use-case. With pre-DHTML HTML4, there really just wasn't anything on a page that could continue to interact with the server after the page finished loading. So, presuming the button was for your described use-case, what would the difference be between "going offline" and just... not clicking any more links? (It's not like Netscape could or should signal your modem to hang up — Netscape doesn't know what else in your OS might also be using the modem.)

Donald · on Nov 11, 2020

You must be young :)

These browsers were born in the era of dialup Internet that had per minute charges and/or long distance charges. At the very least you were tying up your family's phone line.

Basically it's like paying for every minute your cable modem is plugged in.

For the feature itself: Netscape had integration with the modem connectivity for the OS and would initiate a connection when you tried to visit a remote page. Offline mode let you disable automatic dialing of the modem.

derefr · on Nov 11, 2020

I ran a BBS, my friend :) I'm quite familiar with modems. I just never used Windows (or the web!) until well past the Netscape era, so I'm not too familiar with the intersection of modems and early web browsers.

> Netscape had integration with the modem connectivity for the OS and would initiate a connection when you tried to visit a remote page.

That's not "integration with modem connectivity", that's just going through the OS's socket API (or userland socket stack, e.g. Trumpet Winsock); where the socket library dials the modem to serve the first bind(2). Sort of like auto-mounting a network share to serve a VFS open(2).

Try it yourself: boot up a Windows 95 OSR2 machine with a (configured) modem, and try e.g. loading your Outlook Express email. The modem will dial. It's a feature of the socket stack.

These socket stacks would also automatically hang up the modem if the stack was idle (= no open sockets) for long enough.

My point was that a quiescent HTML4 browser has no open sockets, whether or not it's intentionally "offline." If you do as you say — load up a bunch of pages, and then sit there reading them — your modem will hang up, whether or not you play with Netscape's toggles.

(On single-tasking OSes like DOS — where a TCP/IP socket stack would be a part of a program, rather than a part of the OS — there was software that would eagerly hang up the modem whenever its internal socket refcount dropped to zero. But this isn't really a useful strategy for a multitasking OS, since a lot of things — e.g. AOL's chatroom software presaging AIM — would love to poll just often enough to cause the line that had just disconnected to reconnect. Since calls were charged per-minute rather than per-second, these reconnects had overhead costs!)

> [Netscape's] offline mode let you disable automatic dialing of the modem.

When you do... what?

When you first open the browser, to avoid loading your home page? (I guess that's sensible, especially if you're using Netscape in its capacity as an email client to read your already-synced email; or using it to author and test HTML; or using it to read local HTML documentation. And yet, not too sensible, since you need to open the browser to get to that toggle... is this a thing you had to think about in advance, like turning off your AC before shutting off your car?)

But I think you're implying that it's for when you try to navigate to a URL in the address bar, or click a link.

In which case, would the page, in fact, be served from the client-side cache, or would you just get nothing? (Was HTTP client-side caching even a thing in the early 90s? Did disks have the room to hold client-side caches? Did web servers by-and-large bother to send HTTP/1.0 Expires and Last-Modified headers? Etc.)

fiddlerwoaroof · on Nov 11, 2020

I used to go into offline-mode so the browser would access pages from the cache when I went to their URLs. It wasn’t a ton, but it was enough that you could queue up a handful of sites, go offline and then, if you accidentally closed the tab, re-open it and see the caches version.

derefr · on Nov 12, 2020

Haha, “tab” :)

fiddlerwoaroof · on Nov 12, 2020

I think the version of Netscape I'm thinking of (6/7) had tabs, but it's been a while.

numpad0 · on Nov 12, 2020

- What does “offline mode” do?

Browser shows “I’m offline. Do you want to proceed?” dialog. If OK, tries to open socket. Upon such request, OS will try to dial up on modem with PPP, which takes ~30s. PPP session disconnects after few minutes. Else abort. If browser is online in the first place, skip that dialog part.

- Was caching a thing? Did disks have room to hold cache?

Oh it was, there were rooms. Alas, 56kbps(7kB/s) ideal means each pages are kilobytes to single megabyte large at most. ADSL connections at ~1Mbps(1/8 MB/s) means a 256kB page loads in 1 second on a fine day. Even a 200MB disk space holds 1024 of such ~200kB pages, and IE back in those days had a hidden attributes but ordinary folder deep down from C: that holds about that large random cache of files that it decided to hold.

Fuck I’m old now

slumdev · on Nov 12, 2020

> In which case, would the page, in fact, be served from the client-side cache, or would you just get nothing?

IIRC, some of the browsers, in "offline mode", would actually serve pages from the local cache, not even attempting to fetch them from the remote server. If you attempted to navigate to a page that hadn't been cached, you got an error of some kind.

coryrc · on Nov 12, 2020

I can confirm this behavior, I was there.

shakna · on Nov 12, 2020

> Was HTTP client-side caching even a thing in the early 90s? Did disks have the room to hold client-side caches?

I was considered mad at the time for upping my cache to a whopping 2Mb by friends, but Netscape's cache was highly configurable.

Things like cache pages, but not images, always cache bookmarked pages, cache iframe pages (which were often navigation), etc. Netscape 4 added CSS to that mix.

rzzzt · on Nov 11, 2020

Server-Sent Events? How old is that mechanism?

shakna · on Nov 12, 2020

2006 was when Opera added the functionality which would become Server Sent. [0]

[0] https://dev.opera.com/blog/event-streaming-to-web-browsers/

asdff · on Nov 11, 2020

Or even dialup. Expecting a phone call? Go offline.

rusk · on Nov 11, 2020

Kind of. HTTP was designed with caching in mind, so the idea was that if you GET a page it should more or less not change and you could add headers and stuff to instruct proxy servers about whether to cache or not and for how long. I think you could use HEAD then to check if a page had changed ...

The browser cache used to actually be quite dependable as an offline way to view pages but this seems to have fallen out of favour in the mid naughties. I remember how disgusted I was when I realised Safari was no longer me letting see a page unless it could contact the server and download the latest version.

I used to have a caching proxy server that would basically MITM my browsing and be more vigilant than even the cache and it really worked quite well. This was back in the 90s when every bit of your max 54kbs counted, or when you wanted to read something while your Dad or sister wanted to also use the phone.

Anyway, you can no longer take this approach because bad people broke the Internet and now you have to have a great honking opaque TLS layer between you and the caching servers so there's no way for this optimisation to work any more.

Of course it isn't really as important these days because we've got faster connections and interactions with the server are far less transactional and richer. But I still would like to have a way of tracking my own webusage and being able to go back in time without having to actually revisit each and every site.

These days you have to hack the browser because that's where your TLS endpoint emerges. Kaspersky tried this for their HTTP firewall application and there was ructions over that.

I'll defo take a look at this. Sounds just like what I've been looking for.

> Isn't this how the internet was supposed to work in the first place? I remember Netscape navigator having a 'go offline' icon in the corner.

Thinking back actually, if you forget about "the web"/HTTP - then yes actually - this is exactly how usenet worked and now I'm remembering that the "go offline" button used to download all your newsgroups along with your email and stuff so you could look at it all offline :-)

If you want something that's like Usenet these days check out Scuttlebut.

johnmaguire · on Nov 11, 2020

> Kind of. HTTP was designed with caching in mind, so the idea was that if you GET a page it should more or less not change and you could add headers and stuff to instruct proxy servers about whether to cache or not and for how long. I think you could use HEAD then to check if a page had changed ...

Nothing in the HTTP spec has changed about this AFAIK. The internet still behaves this way.

> Anyway, you can no longer take this approach because bad people broke the Internet and now you have to have a great honking opaque TLS layer between you and the caching servers so there's no way for this optimisation to work any more.

You could certainly still do this, you just need to import the proxy's CA certificate into your browser. It's just not possible to do to others on your network without consent now. (This is a security feature, not a bug.)

rusk · on Nov 12, 2020

I’m all ears ... got any pointers for me?

I was under the impression that TLS everywhere breaks the web as it was designed, ergo though nothing in the spec has changed - the environment itself has changed.

You could probably do something with isomorphic encryption but nobody seems that bothered. It’s probably of limited value nowadays as we use the web more as an rpc layer.

hnick · on Nov 12, 2020

It's what bigcorps do to monitor employee traffic, my work laptop has a certificate installed so they can snoop and proxy traffic.

Setup will vary by proxy and OS but here's one example (which I haven't used but at a glance seems OK):

https://www.ssltrust.com.au/help/setup-guides/setup-squid-pr...

romanoderoma · on Nov 11, 2020

Don't know if recently it changed, but that's how internet in Cuba works (at least until 2014)

https://www.google.com/amp/s/amp.theguardian.com/world/2014/...

lights0123 · on Nov 11, 2020

Firefox still has it under File.

teddyh · on Nov 11, 2020

In the modern hamburger menu, it’s under “More”.

anonymfus · on Nov 11, 2020

Ever if you have menu disabled, you can still open it by pressing Alt, no need to suffer using hamburger.

runxel · on Nov 11, 2020

Sure, but does it do anything?

nocman · on Nov 12, 2020

Yes, it disables access to remote resources:

https://support.mozilla.org/en-US/questions/1263417

So if you have a local web server (like on localhost), you'll still be able to access it, but you won't be able to access non-local IPs.

boogies · on Nov 12, 2020

My Palemoon still has a “Work Offline” item in its File menu.

abnry · on Nov 11, 2020

I am coding my own hacked together bookmarks manager. I can save any page with a click of the button using SingleFile (a fantastic Chrome extension, by the way!).

Then a cronjob runs and puts it into a folder to be processed into a database, which generates a static html index and puts it in my Google Drive.

Then it syncs offline on my chromebook. Which means that without internet, I can put my chromebook in tablet mode and do some nice reading. I've been very pleased so far.

throwii · on Nov 11, 2020

I currently extract bookmarks from Firefox and Safari and store them inside a local database. Then a cronjob saves them to Wayback machine if a prior check revealed that they are currently not.. donating regularly for that. Mine just makes sure that the pages are not lost, but yours enables offline reading.

I'm uncertain what the best mechanism is, there are so many ways to solve it. From filtering to recrawling for new content to enabling more advanced features, there are so many possibilities.

dimtion · on Nov 11, 2020

You should checkout Shaarli[0] that has a thriving community.

[0] https://github.com/shaarli/Shaarli/

kilroy_jones · on Nov 11, 2020

I had started working on something similar to this, but without the Google Drive component. I wanted something where I could right click and "snag" a file, link or document and have it saved to a server I controlled.

It's not complete, mostly because the frontend is a mess, but the backend is able to save files, pages and links (https://gitlab.com/thebird/snag). Used Rust backend, Svelte and JS for the extension (of course).

johnmaguire · on Nov 11, 2020

Any chance you have this open-sourced or described in more detail somewhere?

abnry · on Nov 11, 2020

It's too hackish, system dependent, and not feature complete yet. I plan to run it as a flask app on the local network as a more intuitive way of tagging and managing bookmarks... lot more to do.

agumonkey · on Nov 11, 2020

one day this will be as famous as youtube-dl

jbc1 · on Nov 11, 2020

https://github.com/pirate/ArchiveBox

wooptoo · on Nov 12, 2020

This is great, thank you!

nikisweeting · on Nov 12, 2020

Your hacked together workflow sounds extremely close to how ArchiveBox works! You should give it a try ;)

girishso · on Nov 12, 2020

Been there done that. Real issue is ranking the search results by relevance to the search query.

graderjs · on Nov 12, 2020

Hello everyone! Thank you for the love I really appreciate it. I did not expect this to blow up like it has I got up this morning and it was sitting on front page and there's a lot more interest which is great. I feel sad for some people to see that they are having issues with the binaries/connecting to chrome, I will get to work on that today and hopefully have some resolutions pushed out in half a day, so please check back for update. And I really appreciate all the feedback, it's very good to know how people are experiencing this. Thank you again for all your support.

unicornporn · on Nov 12, 2020

Thank YOU! And I want to ass that a Chrome extension is not a very internet friendly choice. Firefox needs all the love it can get these days. :)

graderjs · on Nov 12, 2020

Alright, I've heard you and everyone else (many people) on this thread asking about a Ff version. I will make an issue now and seriously consider it. Investigations begin!

Also, an update on the binaries. I just pushed a new set of binaries, and tested the linux and windows ones worked. Hopefully that should resolve the binary issues people were having, but I'm pretty scared some people will still have issues. If you do, please report them, ideally as an issue.

_v7gu · on Nov 12, 2020

I wish Chrome Store Foxified kept working

jmnicolas · on Nov 12, 2020

> Firefox needs all the love it can get these days.

But can we count on Firefox to stay relevant even if Mozilla fired most of the dev team?

jamietanna · on Nov 15, 2020

It's comments like this that will lead to folks not building for Firefox, therefore no one wanting to use it because there are things not built for it

jsjohnst · on Nov 12, 2020

I’m definitely going to try this out ASAP. I often use various hacks to save pages as revisiting pages 6-12months later when I need it again the content has changed or is simply missing. If this can help me maintain my own mini “Internet Archive”, I’ll be very happy and will gladly do what I can to support your efforts.

caymanjim · on Nov 11, 2020

This is a neat idea, but I wish it would respect some basic Unix standards by default. Two big annoyances jump out at first glance: it assumes you want to use port 22120, and it puts its config in ~/22120-arc. Maybe both of these are configurable, but the directory is a terrible default. Use XDG (~/.config/22120) or at least use a hidden directory in the home dir. And the port it operates on should be completely configurable. Naming the project 22120 is a terrible idea, and assuming that port won't need to change is bad practice.

I'm not making any value judgment about the actual tool. It sounds interesting enough. But it should behave better.

neolog · on Nov 11, 2020

> Use XDG (~/.config/22120)

That's only for configuration files. The actual data should go in $XDG_DATA_HOME, by default ~/.local/share/22120/. Many languages have a tool (e.g. https://pypi.org/p/appdirs/ ) for finding the appropriate data/config/cache directory for each OS.

https://wiki.archlinux.org/index.php/XDG_Base_Directory

graderjs · on Nov 12, 2020

Hey, I feel really sorry for you that you did not have a good or accurate experience of this. It must seem like it does not respect your wishes, or would not if your tried it more.

Okay so...

The config is actually located in hidden file, here: https://github.com/c9fe/22120/blob/8b6cc758f14d34f564fd3a838...

    const pref_file = path.resolve(os.homedir(), '.22120.config.json');

And the port is fully configurable here https://github.com/c9fe/22120/blob/8b6cc758f14d34f564fd3a838...

  const server_port = process.env.PORT || process.argv[2] || 22120;

I don't agree with you about the name. And I get if you're saying you feel the name should not be linked to a port number, but i also don't agree.

I guess i can explain my thinking, but I'm not sure if they will help you understand or like it. Basically I like the name and I did consider changing it a few times but it just stuck for a number of reasons. And I can't remember exactly now but the port number might have come after the name. I think it's a good idea and cool if you have the server running on the same board that the name is but at the same time I know you might have something else running on that port so that's why I made it configurable. I guess she might also want to run more than one copy... But I'm pretty sure that we can't open two chrome windows with different debugging ports using the same user data directory.

Anyway, that might be more info than you want or care for, and it might not help, but hopefully you have a clearer picture at least. I get if you still don't like it, but i just feel sorry for you that you didn't have a good experience of this. I did not design it to annoy you.

fizixer · on Nov 11, 2020

I agree what you say while also pointing out that unix home directory has become a complete mess. Anyone (any installed software) can do whatever they like, there is no mechanism of enforcement, and advice in the form of constructive critque or comment is not even a drop in the bucket towards fixing the problem.

noitpmeder · on Nov 11, 2020

I find that I want to use a project _more_ if they follow best practices like using ~/.config/{app_name}. Attention to details like that usually indicate a higher quality piece of software overall.

lxgr · on Nov 11, 2020

Very interesting project.

I wish this was actually (optional) built-in behavior for browsers when bookmarking pages, or at least when adding to a "read later" list like Pocket/Instapaper etc.

Pocket seems to offer something like this, but only in the premium version, so the "permanent archive" ironically seems to go away when unsubscribing.

As a workaround, what if bookmarking a (public) page could actually ping it to archive.org for archival?

gildas · on Nov 11, 2020

I implemented some options for that purpose in SingleFile [1]. They allow you to save the page when you bookmark it and eventually replace the URL of the page with the file URI on your disk.

[1] https://github.com/gildas-lormeau/SingleFile

toomuchtodo · on Nov 11, 2020

Why a zip file instead of a WARC file?

https://en.wikipedia.org/wiki/Web_ARChive

gildas · on Nov 11, 2020

Because it's easier to produce and extract. The zip format also allows creating self-extracting files (I'm referring to SingleFileZ). I'm not sure this is possible with the WARC format.

toomuchtodo · on Nov 11, 2020

I see you answered this in a thread a year ago [1] (came up in a Google search), my apologies.

[1] https://news.ycombinator.com/item?id=21426056

walski · on Nov 11, 2020

see: https://github.com/c9fe/22120#why-not-warc-or-another-format...

> Both WARC and MHTML require mutilatious modifications of the resources so that the resources can be "forced to fit" the format. At 22120, we believe this is not required

gojomo · on Nov 12, 2020

This is false. If you're doing WARC correctly, HTTP resources/responses are stored verbatim.

Perhaps the only possible purer format would be packet captures, say of the full HTTPS session, along with the session keymatter and connection metadata to later extract the verbatim HTTP resources. That'd be interesting, but I doubt that's what this "22120 format" (for which I see no documentation links) does.

johnchristopher · on Nov 11, 2020

Cool, I was so into maff back in the days. I'll give it a try.

(I even wrote this before checking out your link: Have you heard of https://en.wikipedia.org/wiki/Mozilla_Archive_Format from two or three Internets ago ? If so what's your thoughts on it ?)

gildas · on Nov 11, 2020

I would recommend you to take a look at SingleFileZ [1], it should remind you of something ;)

[1] https://github.com/gildas-lormeau/SingleFileZ

sbeckeriv · on Nov 11, 2020

selfplug: https://sbeckeriv.github.io/personal_search/

I am working on a personal project that like this. It is in early stages. I am creating a local search based on my browser history. So it doesnt crawl pages. Also the fetch is out of bounds of the browser so Authed urls are not supported out of the box.

I have a bookmarklet currently that lets met "pin" my page. my pinned pages are my new home page. Its how I keep my tabs closed.

I do not do a full archive level (but i could). Instead you get an offline view that is stripped of most things. example https://raw.githubusercontent.com/sbeckeriv/personal_search/...

Demo of the pin:

https://www.youtube.com/watch?v=5g_mXXFwQlg

a self hosted version is on the roadmap.

severine · on Nov 11, 2020

I wish this was actually (optional) built-in behavior for browsers when bookmarking pages, or at least when adding to a "read later" list like Pocket/Instapaper etc.

Sideshow Ask HN: Didn't Firefox mobile work like this? I could read the reader view items offline...

Anyone knows what's happening with the whole bookmarks/collections situation?

niea_11 · on Nov 11, 2020

I can confirm this. Pages you bookmarked in reader view were saved offline.

I use firefox nightly on android and the feature disappeared (I think) sometime this year when they re-did the whole ui. I can't find an article or bug ticket explaining the reason why.

asaddhamani · on Nov 11, 2020

I have a project https://www.github.com/dhamaniasad/crestify that does the archival to archive.org and archive.today, you might find it useful

nikisweeting · on Nov 12, 2020

There's also https://github.com/oduwsdl/archivenow which supports a bunch of different archiving platforms.

kall · on Nov 11, 2020

This is a feature of iOS safari with the read later list. It‘s not been particularly reliable for me though.

lxgr · on Nov 11, 2020

iOS's implementation is definitely useful, but I was thinking more along the lines of a permanent archive persistently stored.

iOS seems to optimize for temporary offline scenarios; saved pages do not seem to be backed up or synced to iCloud.

kall · on Nov 11, 2020

Yeah. I also assume it deletes the pages after they are "read" but who knows, there‘s no insight into the feature.

The best bookmarking option for archival seems to be the pinboard.in archive plan.

nikisweeting · on Nov 12, 2020

ArchiveBox also saves every page to archive.org by default.

shrike · on Nov 11, 2020

Pinboard.in (not affiliated, just a happy customer) offers an archiving service for saved bookmarks.

tokamak-teapot · on Nov 11, 2020

What it doesn’t offer is an integration with the browser to make it seamless to work with those bookmarks. There are various extensions for Firefox which will save to Pinboard (one of them is mine!) but to work with them - you have the option of going to the website or using the mobile site in a sidebar (I do this, with some custom css to make it more readable for me).

There’s a nice MacOS application (sorry can’t remember the name right now) which gives you a better interface, but... they’re bookmarks. I would like them to be integrated with rather browser bookmarks. And to be usable when the site is down or I’m offline. And to appear when I search... lots of possibilities there.

yamrzou · on Nov 11, 2020

This uses Chrome DevTools Protocol in a pretty clever way. I used it to archive a highly interactive website and it worked like a charm.

The README states: "It runs connected to a browser, and so is able to access the full-scope of resources (with, currently, the exception of video, audio and websockets, for now)"

I wonder what kind of limitations makes it hard to intercept those resources like the rest of the content.

jmaygarden · on Nov 11, 2020

Video and audio is probably just a matter of not having gotten around to it yet. WebSockets are another matter. I’m not sure what one would do with a two-way channel in a general sense. It’s often not an idempotent operation.

graderjs · on Nov 12, 2020

This guy is right I just haven't gotten around to the audio and video it's a little bit more work I think. websockets is not a normal request response thing that I think is easy to cache so it's a little bit more complex and I haven't really begun to think about it in a way that I feel will be a good approach... Or any approach that I think could work.

imhoguy · on Nov 11, 2020

Video stream is rarely accessed in one request. There may be prelight request to read format structure and of course any seek operation fragments it even more (Accept-Range header). That makes it hard to assemble reslurce for replay later.

nexuist · on Nov 11, 2020

Video files are massive, so it may just be the case that archiving videos takes so long they didn't want to support it.

graderjs · on Nov 12, 2020

That's also a consideration. And I haven't thought of a way to approach it where we get the nice sort of size of the archives that we have currently, and when adding video it would just balloon and I think that could be surprising so I haven't thought of a good approach to gel all that.

nexuist · on Nov 12, 2020

Not to mention things like e.g. livestreams ala Twitch, all it takes is one long stream at high resolution to take up your entire disk and then you can't archive anything else.

severine · on Nov 11, 2020

Given the hard 'no' in the FAQ, does anyone know about a similar project for Firefox?

BlackLotus89 · on Nov 11, 2020

Like the title of the github repo suggests ArchiveBox can be used. You have to manually import your browsing history though....

In theory you could also use yacy... But that is intended as search engine and not as archive.

Edit: while looking into it I found alternatives [2] and Memex [3] seems to be interesting.

Edit2: I remember 2 Show HNs. One recorded your entire desktop and made it searchable. Can't remember what that was called, but the AllSeingEye I found [4]

[0] https://archivebox.io/

[1] https://yacy.net/

[2] https://docs.archivebox.io/en/latest/Web-Archiving-Community...

[3] https://getmemex.com/

[4] https://news.ycombinator.com/item?id=7886270

severine · on Nov 11, 2020

Thanks for your answer, I had missed ArchiveBox completely!

I currently use Memex, but this is different approach, and I keep looking for a polished experience that can get more mainstream users into archiving/offline browsing.

ryanfox · on Nov 15, 2020

Any chance you’re thinking of APSE? I submitted a Show HN a while back.

https://apse.io/

vezycash · on Nov 11, 2020

Add webrecoder to the list

BlackLotus89 · on Nov 11, 2020

It's now called Conifer and listed under my [2] link :) but thank you for mentioning it by name so I could look it up again, seems interesting.

Looks like I got some research for this week

graderjs · on Nov 12, 2020

Look now that Firefox seems to be supporting the webdriver protocol more closely, the progress of FF to which I don't follow that closely, so it might have caught up quite a lot, but it will be possible it'll just be a bit more work and I don't care to do that without knowing you know how many people it would really affect. I feel quite sorry for the people to who don't enjoy seeing they can't use Ff.

Also, I don't want to do it if I have to change the way I use the protocol so I would want the same methods to be pretty much available and to work the same way.

rzzzt · on Nov 11, 2020

Could the "HAR" file I can save from Firefox' Network tab somehow be used for this? That looks to be a recording from the entire timeline, including payloads.

franga2000 · on Nov 11, 2020

I have used HAR files for archiving purposes in the past and it did work fairly well, but I'm not sure if there's a way of getting them programmatically

ernsheong · on Nov 12, 2020

I built PageDash (https://pagedash.com), which performs more as a web clipper. Try something like this if you want something more lightweight and hassle-free. It does put the pages into the cloud. PageDash has been in operation since 2017 and is more or less self-sustaining (though not enough to make me quit my job).

Previous Show HN: https://news.ycombinator.com/item?id=15653206

jsilence · on Nov 11, 2020

Awesome! I always wanted this and at one point tried to achieve it with WWWOFFLE, but the welcome proliferation of https thwarted that attempt.

Gonna check it.

Unfortunately only for chrome. I am very much used to having my favourite set of Firefox plugins. Will have to check whether I can replicate that with Chrome.

wolco2 · on Nov 11, 2020

Unfortunate state of affairs with firefox extentions. Niche extentions do not exist anymore.

I had to switch to chrome for extensions. Finding a chrome extension that provides similiar functionality to your firefox ones should be easy.

phkahler · on Nov 11, 2020

>> Unfortunate state of affairs with firefox extentions. Niche extentions do not exist anymore.

This seems like something that could be done in a proxy and be browser independent.

codetrotter · on Nov 11, 2020

But then your proxy would need to do the TLS termination. Which is both kinda cumbersome to set up probably, and also it means you can no longer look at the certificates for your connections.

unicornporn · on Nov 12, 2020

> Unfortunate state of affairs with firefox extentions. Niche extentions do not exist anymore.

Interesting. I find Chrome extensions to be very limited in what they can achieve. Can not do without Tree Style Tab...

silon42 · on Nov 11, 2020

I've used http://www.gedanken.org.uk/software/wwwoffle/ a long time ago (when on modem).

What I'd like is to cache the history for each page too (important for news pages).

halukakin · on Nov 11, 2020

This was a nice ie feature 20 years ago.

https://support.microsoft.com/en-us/help/196646/how-to-make-...

xtiansimon · on Nov 11, 2020

I’ve been storing my research as text files (manual copy and paste of web page content) for years.

And, I’ve wanted a _search history first_ plugin for web search to find pages I missed saving, but recall reading.

Since the former takes time and the latter doesn’t exist, I gather I could buy storage and save browsing using this tool.

It would be interesting to see how it works in practice—saving so much data.

Also, For work I’d be interested to know how it works for password protected sites like banking, social media, etc.

hiisukun · on Nov 11, 2020

Just chiming in to say that Firefox location bar has some great filters [1] that might help you search history first (and other things). It doesn't do a full text search, but often helps me in a way I think you're after.

If you type: "^ worms" in the searchbar it will search your history for 'worms' and show the results in the dropdown. Typing "* worms" will search your bookmarks instead. The rest of the shortcut symbols are listed on the linked page. Hope that helps!

[1] http://kb.mozillazine.org/Location_Bar_search

graderjs · on Nov 12, 2020

It works fine for sessions. This is because it saves all HTTP responses (including headers). So in `serve` mode it replays everything as if the server was giving it to you. But in practice, cookies don't matter much (unless their used by the client in JS), since we're not talking to a server anymore, just serving responses from disk. I'll add the sessions info to the README.

ryanfox · on Nov 11, 2020

I’ve been working on an app that’s pretty much exactly “the latter”! [0]

The amount of disk space it takes up isn’t crazy. It has been very useful for me.

[0] https://apse.io

a254613e · on Nov 11, 2020

I remember seeing this on reddit, the license changed quite a lot over the past month, with some very weird custom licenses asking you not to be a fake victim, lie, etc in the process - https://github.com/c9fe/22120/commits/master/LICENSE - how safe is it to assume that the current license will stay?

ciarannolan · on Nov 11, 2020

Assuming good faith in the creator of this, it looks like they tried to type up something that they thought would cover their bases, then realized that wasn't right and copy/pasted in a real license.

graderjs · on Nov 12, 2020

Yes, this person has it exactly right.

For everyone else, here's some definition of the original terms:

fake victim - when you do something and it goes south for you, but you try to mislay responsibility for your choices and try to blame someone else. So in this context it's to do with the standard disclaimer of liability (you won't hold us accountable, etc).

don't lie - basically people stealing the work and pretending it's their own. Like stealing it and relicensing it, etc. Or pretending they are the rightsholders. Basically the standard language about copyright, and sublicensing but in my own lingo.

a254613e · on Nov 12, 2020

Well, not quite.

It seems like the license went something like this:

MIT > Dual GPL/Custom license > No license > AGPL > No License/Commercial only > Custom license > Different custom license > License with a commercial pricing? > AGPL (3 days ago)

Most of this happening over the last few weeks, and the readme says it's dual licensed.

alliao · on Nov 11, 2020

I miss RSS primarily because I was able to search for stuff either I've read or I care about...

I am too embarrassed to admit that a disproportionate amount of my time are spent on looking for a sentence or god forbid a tweet I vaguely remember reading last week.

SO yes, consider this a vote for that sexy full text search please.

leafmeal · on Nov 12, 2020

I've had exactly the same experience! Full text search on history would be amazing, especially when doing research.

graderjs · on Nov 12, 2020

All right thanks I will!

segmondy · on Nov 11, 2020

I just wanna cache all my bookmarks, I rarely look at them, but when I do go look at them, a good chunk tend to have rotted. It will be awesome to cache all my bookmarks and then have option to recursively cache the path I'm in. I don't want to cache every page I visit, 90% is junk.

ben509 · on Nov 11, 2020

I'd recommend using Zotero as it's designed for researchers. It can save most pages for offline viewing, but also grabs metadata, and has specialized downloaders for papers and such.

Main downside is it's not trying to plug into existing bookmarks.

https://www.zotero.org/

graderjs · on Nov 12, 2020

I guess what you can do in this case, is:

1. Download the release (npm or binary) 2. Start it up. 3. Go to chrome://bookmarks 4. Click on a folder and go, "open all". 5. Once they've loaded, click through each tab opened to make sure it loaded properly. 6. Check that they've been added to the index (go to http://localhost:2212) 7. Repeat 1-6 for all the folders of your bookmarks that you want to save. 8. Repeat 7 periodically.

I feel pretty scared for you that this will be too much work for you to feel good about doing it, but I want to say at least it will save it.

I think the use-case is good. I considered it in the past (automatically caching from bookmarks). I feel really bad for you this lets down your use case.

nikisweeting · on Nov 12, 2020

https://ArchiveBox.io is optimized for archiving your bookmarks. You can feed them in from a service like Pocket/Pinboard or use browser bookmarks.

knyazhefilms · on Nov 11, 2020

>have option to recursively cache the path I'm in

It's interesting, what do you mean by that?

segmondy · on Nov 11, 2020

so if I'm viewing http://example.com/foo/bar.html I might want to cache not just /foo/bar.html but everything underneath /foo/ path

avmich · on Nov 11, 2020

Awesome thing, but -

> Can I use this with a browser that's not Chrome-based? > No.

Note that a (rather similar) thing I've participated in in 2002 was browser-neutral.

SirLotsaLocks · on Nov 12, 2020

yeah I was really excited until I got to that point.

graderjs · on Nov 12, 2020

I feel so bad for you guys that you can't use it on Ff. I've opened an issue now

https://github.com/c9fe/22120/issues/57

but it might not happen. I'm only considering it and I will investigate. I'm sorry for you that you can't use it now.

m0zg · on Nov 12, 2020

Search in the roadmap is exciting. It'd often be great to constrain search to just the pages I already visited, and there's never an option to do so. And because it's not a full index of the web with billions of pages, this search could be _a lot_ smarter than general purpose search, since it could actually understand both the pages and the query better in a completely privacy-protecting way using recent techniques that are cost prohibitive if you have to run at 1M QPS. Make it so!

graderjs · on Nov 12, 2020

It will be done! :p ;) xx

hnguy321 · on Nov 11, 2020

Anyone know of something like this that can sit on a network, possibly as a web proxy?

erulabs · on Nov 11, 2020

Squid (http://www.squid-cache.org/) is fairly close to what you're looking for.

nikisweeting · on Nov 12, 2020

Webrecorder.io is the one you want if you're looking for a proxy archiver. PYWB (the underlying library) is by far the best proxy archiver around imo.

nikisweeting · on Nov 12, 2020

Now called https://github.com/Rhizome-Conifer/conifer ^

dksidana · on Nov 11, 2020

Reminds me days of Webaroo[1] and Google grears[2]

[1] https://en.m.wikipedia.org/wiki/Webaroo [2] https://en.m.wikipedia.org/wiki/Gears_(software)

lxgr · on Nov 11, 2020

Ah, that brings back memories. Didn't Palm OS have something similar? I think it was Plucker [1], but I'm not too sure.

[1]

reaperducer · on Nov 11, 2020

Yep. With Plucker, I could download the New York Times web site (I think via RSS) before I went to work, sync it to my Palm Pilot, and then read it on my lunch.

vmlinuz · on Nov 12, 2020

Many, many years ago, I used a custom build of Plucker to build an offline reader for the schedule of a large conference in the UK - to the point of rebuilding the schduule file daily to include updates, and using an old laptop with an irda port running pilot-link in a loop to sync the new schedule to anyone who wanted it... That was fun.

nikisweeting · on Nov 12, 2020

This looks pretty cool, I'm excited to try it. The choice of lesser known format vs WARC is interesting, afaik WARC can do pretty complex stuff (like replaying youtube videos).

See: https://github.com/Rhizome-Conifer/conifer

graderjs · on Nov 12, 2020

Coming from the author of ArchiveBox, that's greatly respected to see your comment here, thank you!

tiborsaas · on Nov 11, 2020

I'd love to see an entry in the FAQ explaining the weird name.

guavaNinja · on Nov 11, 2020

Just the port they used by default, to help remember it

graderjs · on Nov 12, 2020

It is but it also wasn't. The name might have come before the port. I might add that entry.

ppezaris · on Nov 11, 2020

Cool concept. In a world that's getting increasingly connected what are the main use-cases?

I ask because the dev tool that our company creates occasionally (okay, very rarely) gets a question about offline mode, and when I prod, it's usually just out of curiosity, not because they actually need it in real life.

jaggirs · on Nov 11, 2020

The coolest part I think is that you have a copy of all these websites on disk, which means you can run a full text search on all the websites you visited (or on their html, technically).

Browsers'history sucks. I don't know if this project does this, but I would absolutely love to be able to do SQL queries on my browsing history.

I have 'lost' many websites I remember visiting, but for which I didn't remember anything in the title.

Also, obviously, websites change sometimes, and the web archive might not have cached the website you visited. Although from what I can tell, this project doesn't version websites, it just caches the latest, so you would probably just overwrite the previous version accidentally.

reaperducer · on Nov 11, 2020

In a world that's getting increasingly connected what are the main use-cases?

Increasingly ≠ totally.

Even though I'm a developer, pre-pandemic I would have to spend a day or three offline several times a year while working. This would be useful for that.

I know an IT guy who works in mines. He loves anything that works offline.

lxgr · on Nov 11, 2020

This seems to geared towars "content goes down" scenarios, rather than "reader is temporarily offline".

It's a concern I have every time I find a particularly interesting independently hosted blog post or article.

The Internet Archive goes a long way towards making me worry about this less, though. (Let's just hope they don't go away!)

graderjs · on Nov 12, 2020

It's both. Not great for suddenly offline, as you need it on all the time and it is slow. But if you can plan it like, you're going in a mine, or spaceship or airplane, or the bush, or an alien attack or whatever... and you need offline to keep reading, then this works.

You can have archives from x day or x week or whatever, organized yourself, versioned with git, as you like, to save web content through version changes, removal or vanishing.

jb775 · on Nov 11, 2020

Would be cool if you could create a whitelist of websites, then have a feature to check if any other users have more recent versions of those sites (if they happen to be online). This way you get decentralized site updates without actually going to each site itself.

cutemonster · on Nov 11, 2020

Yes. But also keep one's own originally downloaded version, in case the newer version is messed up

And even more cool: If one could browse one's friends' sites, while everyone was offline (if their privacy / sharing settings allowed), just a local net in maybe a rural village

Edit: roadmap: "Distributed p2p web browser on IPFS" -- is that it? :-)

CGamesPlay · on Nov 12, 2020

Super cool! I build a similar project last year but I haven't put much work into it because I had built that version as an alternative browser and there's way too much overhead for too little gain. I like the idea of using Dev Tools to get the requests. If I revisit my project, I would build it to register itself as the system proxy, including SSL MITM.

For future work on this project, consider a search engine built on top of the downloaded files! Also, gzip your JSON.

My older project for anyone interested: https://github.com/CGamesPlay/chronicler

graderjs · on Nov 12, 2020

Yes i also want to add the search. Just not sure the best full text search using Node for filesystem files

ghostbrainalpha · on Nov 11, 2020

Very cool idea. I always bring my laptop with me camping in case I get the urge to write something.

Having the ability to see the last week or so of my browsing history would have come in handy on more than one occasion.

graderjs · on Nov 12, 2020

That's also what I think. For keeping on planes, and keeping the past version alive as a local copy. But honestly I didn't create it for that reason I think. I can't remember exactly now without using my power to think about it more...but basically I think I created it to just see if it was possible. I was doing lots of work in Chrome DevTools at the time and the idea suddenly came to me, and I wanted to see if it would actually work. This project has constantly surprised me by how much "legs" it has. People just love it. And the first version took me very little time to do. It's weird, other things that take a lot of time are not so popular. Utility or popularity with others has nothing to do with hard work, nor the amount of work, I think.

ghostbrainalpha · on Nov 12, 2020

Ya, there could be so many use cases for this thing.

I thought of some evil ones, like I know some people at my company that would love to be able to browse an employees past browsing history at their convenience.

It doesn't really need to be "offline" for that to work, but I can see that playing into calling it a security procedure rather than blatant overstepping.

xiphias2 · on Nov 11, 2020

I'd love to use this on my mobile, as that's where I mostly have problems with connecting to internet, but it still looks pretty interesting

graderjs · on Nov 12, 2020

It's a good point but so far I don't think there's a way to open the debugging Chrome port on mobile and connect to it. But I think that probably should be some way, it's just not obvious to me right now.

nosmokewhereiam · on Nov 11, 2020

This is really cool. Thank you for being open source.

imhoguy · on Nov 11, 2020

"Distributed p2p web browser on IPFS" - killer feature on the roadmap! Please bring us back to the roots of the (d)Web.

deelawn · on Nov 11, 2020

Seems cool, but can someone explain to me the need for all of the obfuscated code in the files with "22120" in the name?

totony · on Nov 11, 2020

I think those are the build artifacts

graderjs · on Nov 12, 2020

That's true. If you ran without minimize in webpack it would not look obfuscated

atum47 · on Nov 11, 2020

Very interesting indeed. I remember having to paste a script in the console in order to be able to view my cached files.

darepublic · on Nov 11, 2020

Interesting stuff I will look into this more

jasalo · on Nov 12, 2020

This sounds like the perfect solution for Cuba's highly restricted access to internet. There's a whole network of people getting content from the US and distributing it on a weekly basis (mostly movies, TV shows and news)

sean_pedersen · on Nov 12, 2020

This would be nice to have in combination with IPFS. https://ipfs.io/

peterburkimsher · on Nov 11, 2020

It looks like something I'd appreciate! I make a significant effort to archive things that I think I'll need.

Unfortunately it didn't work when I just tried installing it now (macOS 10.13.6, node v14.8.0).

  MacBook-Pro:Desktop peter$ npx archivist1
  npx: installed 79 in 8.282s
  Preferences file does not exist. Creating one...
  Args usage: <server_port> <save|serve> <chrome_port> <library_path>
  Updating base path from undefined to /Users/peter...
  Archive directory (/Users/peter/22120-arc/public/library) does not exist, creating...
  Created.
  Cache file does not exist, creating...
  Created!
  Index file does not exist, creating...
  Created!
  Base path updated to: /Users/peter. Saving to preferences...
  Saved!
  Running in node...
  Importing dependencies...
  Attempting to shut running chrome...
  There was no running chrome.
  Removing 22120's existing temporary browser cache if it exists...
  Launching library server...
  Library server started.
  Waiting 1 second...
  {"server_up":{"upAt":"2020-11-11T21:48:25.324Z","port":22120}}
  Launching chrome...
  (node:33988) UnhandledPromiseRejectionWarning: Error: connect ECONNREFUSED 127.0.0.1:9222
      at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1144:16)
  (Use `node --trace-warnings ...` to show where the warning was created)
  (node:33988) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 1)
  (node:33988) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
  (node:33988) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'writeFileSync' of undefined
      at ae (/Users/peter/.npm/_npx/33988/lib/node_modules/archivist1/22120.js:321:14209)
      at Object.changeMode (/Users/peter/.npm/_npx/33988/lib/node_modules/archivist1/22120.js:321:8088)
      at /Users/peter/.npm/_npx/33988/lib/node_modules/archivist1/22120.js:321:16174
      at s.handle_request (/Users/peter/.npm/_npx/33988/lib/node_modules/archivist1/22120.js:128:783)
      at s (/Users/peter/.npm/_npx/33988/lib/node_modules/archivist1/22120.js:121:879)
      at p.dispatch (/Users/peter/.npm/_npx/33988/lib/node_modules/archivist1/22120.js:121:901)
      at s.handle_request (/Users/peter/.npm/_npx/33988/lib/node_modules/archivist1/22120.js:128:783)
      at /Users/peter/.npm/_npx/33988/lib/node_modules/archivist1/22120.js:114:2533
      at Function.v.process_params (/Users/peter/.npm/_npx/33988/lib/node_modules/archivist1/22120.js:114:3436)
      at b (/Users/peter/.npm/_npx/33988/lib/node_modules/archivist1/22120.js:114:2476)
  (node:33988) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 3)
  ^CCleanup called on reason: SIGINT
  MacBook-Pro:Desktop peter$

graderjs · on Nov 12, 2020

Damn I got that issue before...Just now reported on the repo. I feel really sorry for you and all the people who haven't had the good experience of using this. I will take a look.

graderjs · on Nov 14, 2020

This issue means that we couldn't connect to chrome debugging port. Maybe we couldn't open chrome, maybe because we couldn't shut the existing chrome.

To possible remedy, and diagnose more, try

  $ export DEBUG_22120=BLEHMEHEKTAA
  $ npx archivist1@latest