Web Clipper Browser Extension with Automatic Content Extraction, Now Open Source

laybak · on Sept 13, 2020

Hi HN, a few months ago I started building a suite of knowledge management tools to for my own needs. It's been a long iterative process of noticing patterns (and inefficiencies) in my workflow, building simple tools to improve it, and evolving the UX over time.

One of the tools I have been using daily is a web clipper that captures not just the current page, but can automatically extract key information from it. You can also do a quick lookup of your existing notes regardless which web page you are on.

Prior to this, I had been using web clipper extensions by Evernote, OneNote, and Notion, and all of them had something missing that would significantly slow me down. Wanted to share what I have built to address this. The code is integrated with the [Rumin](https://getrumin.com) backend (the other tools I built), but you can easily swap out the API calls to point to local storage or some other endpoint.

Check it out. Would love to hear feedback from the community :)

neovive · on Sept 13, 2020

Great project! Rumin looks very interesting as well. I was a long-time Evernote Web Clipper user, but switched to Notion a few months ago. I'm much happier with Notion's web clipping workflow and table storage approach, but it's not perfect.

laybak · on Sept 13, 2020

thanks neovive! Yeah Notion's web clipping and table storage approach is quite elegant. It only gets a bit clumsy when we get to the "power user" (not very common) use cases

slykar · on Sept 13, 2020

Wouldn't it be possible to support any video by simply scanning for <video> tags and getting current playback information from there? I'm not sure, but is the extension able to control he video playback in order to navigate to the right time?

laybak · on Sept 13, 2020

great idea! yeah that would seem to be a better design. Regarding the playback time, currently I'm adding the "t=[TIME]s" parameter to the captured url, so it works for YouTube. But there are definitely more elegant solutions as this scales to support more websites.

paultopia · on Sept 13, 2020

This looks really interesting! FYI, the rumin front page works, but the about page currently gives a 500 error code (HN slammed?)

laybak · on Sept 13, 2020

thanks Paul! well that's embarrassing lol just pushed the fix

karlicoss · on Sept 13, 2020

Good work! For extracting meta information -- a set of community maintained information scrapers (html, or intercepting ajax) for different websites could be cool. It's hard to maintain all the sites on your own (especially the ones you don't use), and by sharing we could perhaps avoid redoing the same thing twice.

Your extension, of course, Wildcard (https://news.ycombinator.com/item?id=22439141), youtube-dl, and possibly many other could benefit from it.

laybak · on Sept 13, 2020

thanks! and yeah that's a very good point, and one of the main reasons why I'm open sourcing this.

Community maintained information scrapers/extractors is definitely a direction I want to build towards, collaborating with any existing efforts. Though the exact form will take some iterations (e.g. a marketplace for scripts/"recipes", built-in scripts for common sites, allowing individual users to save their own scrapers etc.)

wila · on Sept 13, 2020

I was a huge fan of the old "Scrapbook" plugin. That one died as FireFox switched their API's.

Looks like there's a new version WebScrapBook [1] based on that old code base which is now available.

[1] https://addons.mozilla.org/en-US/firefox/addon/webscrapbook/

mark_l_watson · on Sept 14, 2020

Nice, this looks extremely useful.

Years ago, I spent a couple of months building a simple EverNote clone in Clojure. The weakest part of my “for my own use only” project was a FireFox extension I wrote to capture selected web page data and send it to the backend of my system.

This Web Clipper project would have really helped me. I hope the author of this gets the satisfaction of wide adoption in many cool projects.

laybak · on Sept 14, 2020

author here, this comment already made my day :) feel free to take the code and run with it, and let me know if you have any suggestions/questions

kapnobatairza · on Sept 14, 2020

This looks great! I use Evernote Web Clipper but spend a lot of time adding context/information/screenshots manually, this would save me a ton of time. I requested access to Rumin and will definitely try swapping this into my workflow.

laybak · on Sept 14, 2020

Thanks, kapnobatairza! working on cleaning up the product before opening up, will reach out to ask you more about your use cases :)

tyingq · on Sept 13, 2020

Looks pretty slick, though custom metadata for just 7 sites seems pretty low for launch. Perhaps the default metadata capture is good enough for sites like Wikipedia, Amazon, etc, that aren't covered?

laybak · on Sept 13, 2020

thanks for checking it out! yeah the coverage for the metadata capture definitely needs to be improved. At this point, this just includes the top sites for my own use cases (and some early users).

I was hoping by sharing it I can get a better sense of what sites other people would like to have supported, and keep adding to it :)

rektide · on Sept 13, 2020

Built to assist a proprietary locked down thing, but heck yes

laybak · on Sept 13, 2020

haha for now...the main reason being it's just me working on it at the moment, and I'm fixing/cleaning things up before releasing more of the code base. the rest of the product is pretty clunky (with a beyond shitty code base)

in the meantime, it should be easy to swap out the API hostname to something else (or even local storage)

input_sh · on Sept 13, 2020

Please pick a license. By not doing so, you retain full ownership of the code, preventing other people from modifying it for their own needs. See here for more details: https://choosealicense.com/no-permission/

laybak · on Sept 13, 2020

ohh thanks for the heads up! will look into this. Sorry, this is my first time sharing an open source project.

EDIT: I've added an MIT license to the code. Thanks again for pointing out

santa_boy · on Sept 16, 2020

Sorry but any docs to refer where I could swap out the API to localstorage.

I just tried while taking a class and I believe I need your service account to use this extension.

laybak · on Sept 17, 2020

thanks for checking it out! you can swap out this API call here: https://github.com/jhlyeung/rumin-web-clipper/blob/master/ch...

sorry the code is a bit of a mess, I'm working on cleaning up the entire code base and documenting things better

lucasverra · on Sept 13, 2020

Hi there; congrats on supporting OSS and Rumin .

I'm a active notion web cliper user. I trust Notion because they have plenty o money to not lose my data.

Why should i use rumin ?

laybak · on Sept 13, 2020

Thanks Lucas! I was a Notion web clipper user as well. For me it worked for the most basic use case of saving a page into a table. But for me these use cases kept coming up: - An idea belongs to multiple collections, as opposed to a single "table" - There are usually properties/metadata I want to save (e.g. YouTube channel information), which would take multiple copy-and-pastes back and forth each time - Bi-directional linking of captured content - I wanted full-control over the captured data, for more advanced queries/filtering and it's sad that web clippers tend to be one of these "table stakes" features that companies build a basic version for, and not invest further in.

Quick answer for "Why should I use Rumin?" is: "Perhaps you shouldn't yet, but let's stay in touch and I'd love to hear about your use cases and other ideas."

The current version of Rumin is very rough, and there's an overwhelming list of improvements to make. This is one reason why I closed it for sign ups for now. But in the meantime, I feel there's a lot the community can do even with just the web capture component being open source.

Regarding your concern about data loss. I intend to open source more and more parts of the platform, and somehow figure out a model to make the development sustainable.

Thanks again for checking it out!

Causality1 · on Sept 13, 2020

Ah, web clipping. I haven't heard that phrase used since PDAs were running on the Mobitex network and had to use web clipping to usefully browse the internet at all.