One of the first projects we're building on top of Fathom is a collection of 'ru...

One of the first projects we're building on top of Fathom is a collection of 'rules' for extracting a consistent metadata representation of web pages. For instance, many pages expose Open, Graph tags to identify semantic metadata such as title, description, icon, preview image, canonical URL, etc. However not all pages use Open Graph, some expose Facebook tags, Twitter tags, generic HTML meta tags, some expose not even those!

We want to use Fathom as the engine for applying a series of rules to look for various forms of metadata in pages and collect them in a consistent fashion. This can be used for storing/querying rich data about pages, presenting nice previews of pages to users, and other applications.

My hope is that this repository acts as a collection point for people to continue to contribute rules for various forms of metadata, from basic page representation, to more domain specific things like product data, media data, location data, etc.

The project can be found here:

https://github.com/mozilla/page-metadata-parser

This library is designed to be used either as a node package within a server side node ecosystem, or client side through npm and webpack/browserify/etc, it's currently tested against node and Firefox.

Soon I hope to wrap it in a RESTful API with a Docker container, which will be found here (still needs docs/tests):

https://github.com/mozilla/page-metadata-service

I recently found a very similar project called `Manifestation` which does almost the exact same thing, so I hope to collaborate with Patrick and integrate the projects if possible.

https://github.com/patrickkettner/manifestation