Hacker News new | past | comments | ask | show | jobs | submit login
Mozilla Fathom: Find meaning in the web (github.com/mozilla)
192 points by gfortaine on July 9, 2016 | hide | past | favorite | 47 comments



Tim Berners-Lee loves to wax poetic on the evolution of our mental model of what's important and what we consider to be the nodes in the computing graph:

TCP/IP: "It's not the wires, it's the computers!"

HTTP/HTML: "It's not the computers, it's the documents!"

Semantic Web (/Fathom): "It's not the documents, it's the things that they are about!"

https://www.w3.org/DesignIssues/Abstractions.html

I look forward to the day when the idea of discrete, atomic documents is finally abstracted away in the same manner that we abstracted away the physical machines that host them.


I agree that our uses for the web will evolve again once we embrace the semantic web. To repurpose some marketing-speak, I see the semantic web as 'Web 3.0' (as it goes beyond the possibilities of the web apps of 'Web 2.0').

There does seem to be some overloading of the term 'semantic web' though. Sometimes the term is used to refer to semantic markup, such as using the <em> HTML tag instead of using the <i> HTML tag...

https://en.m.wikipedia.org/wiki/Semantic_HTML

Other times, the term semantic web is used to refer to structured meaning transmitted through specialised metadata...

https://en.m.wikipedia.org/wiki/Semantic_Web

Fathom seems to sit across the two, though I'd suggest it's closer to the first, unless it's used as a tool to analyse the design of existing web pages for the benefit of new website designs (combined with site usage data, e.g. bounce rates, etc...).


Totally agree! The semantic web (Web 3.0) will change a lot how we for example interact with websites and will finally make the information on the web usable. Currently every website lives in its own little world. As long as we stay on one page it kind of works and I can for example sort and filter by properties, however as soon as I want to combine the information of different pages this ends (example if I want to see the hotel I stay in with the best restaurants in the city from yelp).

That is the reason why we currently work on http://link.fish which is a smart bookmark manager and allows people to work with the information behind the urls. We are currently in a closed beta but am happy about anybody who gives honest feedback.

Here is a short 2 min (low quality) demo video which shows how it works: https://youtu.be/Chfy3le5gY0


Link.fish looks promising, I can see it being a useful tool. Best of luck with it.


Thanks & great to hear!


Note that the Fathom approach rejects TBL's vision of standardized markup for semantics, and instead favours a rule based approach similar to the many "content extraction" services/apis/browser add ins.

I'd note that Google's Knowledge Vault paper[1] found that "semantic web"-type technologies (very widely interpreted as anything that used metadata) was pretty useless as a source of data: only 0.25M out of 140M total "facts" extracted using those annotations were classified as high confidence (ie 0.2%).

This is in contrast to a DOM-rules based approach (which seems similar to Fathom) which extracted 94M out of 1200M total (ie 8%).

The Semantic Web is dead, and the KV paper buried it.

[1] https://www.cs.ubc.ca/~murphyk/Papers/kv-kdd14.pdf


they are still the same kind of machine


One of the first projects we're building on top of Fathom is a collection of 'rules' for extracting a consistent metadata representation of web pages. For instance, many pages expose Open, Graph tags to identify semantic metadata such as title, description, icon, preview image, canonical URL, etc. However not all pages use Open Graph, some expose Facebook tags, Twitter tags, generic HTML meta tags, some expose not even those!

We want to use Fathom as the engine for applying a series of rules to look for various forms of metadata in pages and collect them in a consistent fashion. This can be used for storing/querying rich data about pages, presenting nice previews of pages to users, and other applications.

My hope is that this repository acts as a collection point for people to continue to contribute rules for various forms of metadata, from basic page representation, to more domain specific things like product data, media data, location data, etc.

The project can be found here:

https://github.com/mozilla/page-metadata-parser

This library is designed to be used either as a node package within a server side node ecosystem, or client side through npm and webpack/browserify/etc, it's currently tested against node and Firefox.

Soon I hope to wrap it in a RESTful API with a Docker container, which will be found here (still needs docs/tests):

https://github.com/mozilla/page-metadata-service

I recently found a very similar project called `Manifestation` which does almost the exact same thing, so I hope to collaborate with Patrick and integrate the projects if possible.

https://github.com/patrickkettner/manifestation


Extracting machine-understandable meaning from web pages is much analogous to extracting text from images.

Fortunately, we usually don't need to process web pages using fancy yet hardly accurate algorithms in order to extract machine-readable text from web pages. Why? It's because we agreed to use character codes to codify letters and most of the time text is encoded using some character code, which makes it unnecessary to OCR pictures of hand-written letters to programatically process text from web pages.

These kinds of program wouldn't be needed if only the same thing had happened for page structures, if HTTP included page semantics.


The issue with creating tech for such semantics is whether authors put in the effort to provide metadata. For example, rel=next/previous has been around forever but most webpages don't have them because they are not exposed in browsers or other clients. Other data mentioned in the examples like title and open graph tags are provided for search engines, Facebook previews, and such.


But is Fathom supposed to be for annotating your own web site, or analyzing others'? If the former, then I'm truly bewildered, but it's not clear which it is.

As for rel=next/previous, I don't use them because Google makes it clear that these will make it treat the whole sequence like one paginated document in the index, contrary to their original semantics. I'd love if someone could correct me on this.

[0] http://webmasters.stackexchange.com/questions/61573/is-it-ju...

[1] https://www.w3.org/TR/html401/struct/links.html#h-12.3

[2] https://webmasters.googleblog.com/2011/09/pagination-with-re...


I suspect it's against the website's interests. If you provide semantic marks, it makes it easier to crawl your website, extract the actual content, and leave the ads behind.


The <article> tag already makes it pretty easy.


Well, yes, only then everyone realized it takes you another good 2-3x of the work over and above writing text to put it into a form which is "machine-understandable" with a whole bunch of metadata and requires the people writing the text to be familiar with all of that and have a good idea what is "machine-understandable" and what is not, for..

zero gain. There is just no application. Google works perfectly fine. Give it up already.

And just to take this to another tangent.. Word won the office space. There are no normal people writing LaTeX for their party invitation. And hell, even the people writing LaTeX don't want to bother with the "machine-understandable" thing so they went ahead and made it into a proper Turing complete programming language.


I think the solution is smarter document editors. Auto correct / suggest with context awareness, database integrations, machine learning / basic AI, and so on.

Even simple stuff like asking the writer to clarify which piece of a text that a written date applies to would be helpful, and to define what parts of long sentences with many commas belong together (which subsentences are interjections, for example?).


HTTP does/can include a lot in the way of semantics. It's entirely down to how the developer decides to write their markup.


You mean HTML, not HTTP, no?

Otherwise I'd be curious to hear how or why including web page layout semantics in HTTP would be useful.


I meant HTTP but I have to admit it was not clear. What I had in my mind when I wrote that was there needs to be something that forces people to provide semantics of web pages. By which I mean HTTP is too liberal, allowing any kind of document including documents without semantic annotations to be transferred. Therefore I thought people would provide page semantics if HTTP required documents be annotated with semantics just like HTTP requires Content-Type be given.


Ah, I see, that makes sense. The protocol level does seem like a good way to enforce it. Though I do have to wonder if, had that enforcement been put in place, people would have moved away from HTTP and toward a different, looser protocol on top of TCP. Or maybe that wouldn't have been practical. It's interesting to think how early, seemingly low level decisions about protocol design can have a profound effect on how things develop down the road.


This is an interesting library to watch for sure. Personally I have built many scrapers and extractors to be used in house and I have spent many hours on tweaking Readability JS and I know how complicated and hard-to-test the code is. Seeing how Fathom does its job is cool -- it takes care of a lot of the low level, bookkeeping parts so that all you need to do is to focus on tweaking the ranking formula. I'm not surprised if in the future we will have a shared repo containing "recipes" to parse pages, and slap on a nice UI with DOM traversal then we'd have a Kimono-like app for parsing contents.


Yes you are exactly right, that is exactly what we are planning with:

https://github.com/mozilla/page-metadata-parser

This repo is exactly what you describe, meant to be a collection of 'recipes' or 'rules' for extracting various forms of metadata from pages. It's very early in its infancy but we are nearing deploying a first version of this to users via Test Pilot:

https://testpilot.firefox.com/

I would love feedback or contributions!


It looks based on rules. There are Python libraries which try to solve similar tasks using Machine Learning: https://pypi.python.org/pypi/autopager, http://formasaurus.readthedocs.io/en/latest/, https://github.com/scrapinghub/page_finder. I wonder how the quality compares. It is hard to make rules work reliably on thousands of unseen websites.


HTMLElement.dataset is string-typed, so storing arbitrary intermediate data on nodes is clumsy

I use dataset extensively. Once each DOM element is assigned a unique key id, many things get simplified: DOM manipulation, client server state sync, event handling, etc. Now if an additional data layer can resolve key id to semantic metadata, the only problem left would be to expose that data so third parties can read from it. So, the problem could be solved with three data points: url, dom key and the resolved metadata value.

What's really needed isn't a new protocol or set of rules and conditions. OpenGraph probably works just fine. It's a dedicated global database of metadata. A web white pages directory. As OpenGraph is perhaps dominant, it may make sense for someone like Facebook to provide this. Historically, third party non-profit services have not been popular.

All of this is of course contingent on the assumption that content hosters provide that metadata to begin with ;)


Happy to see another really interesting project from Mozilla.

Still worried about the extensions ecosystem though.


There's been a lot of misinformation about the extensions changes out there; what are you worried about? Most of the worries about the upcoming Firefox changes to addons seem to come out of this misinformation :)


> what are you worried about?

It is said to be similar to chromes. And chrome doesn't support real extensions like tree style tabs etc.

Happy if you can tell me I am misinformed.


Yes, that is wrong :)

So what's happening is that the extension stuff is going to be standardized, and that standard will be closer to Chrome's model.

However, this doesn't mean that functionality will be lost -- this is the key bit of confusion around this that's causing all the resentment.

Instead, the protocol will be extended. The intention is so that almost all extensions, especially the popular ones like tree style tabs, will keep on working; however they will need to be written to work with the new APIs. Functionality from Firefox's API will only be removed when there is something on the webextensions side that gives the same abilities.

The current system lets plugin developers directly hook into Firefox source, which means that theoretically any change is a breaking change. Architectural overhauls like electrolysis were impeded by this too. Instead, the desire is to switch to a proper API that is separate from the inner functionality; which is basically a core principle of API design.

Chrome already does this -- the API is limited -- but the generic model is the right one. The plan is to do something similar in Firefox, with a much larger API (eventually) than what Chrome has.

At first, probably just the Chrome API will be implemented, but IIRC regular firefox addon APIs will not be removed until the corresponding APIs exist in the webextensions format.

This will mean that authors of old addons may need to put some work into it to make it work again (which they had to do anyway for electrolysis, really). Also, obscure addons using very random internal APIs may also break entirely. But for the most part the average addon user shouldn't notice anything.


> Yes, that is wrong :)

Fantastic news, thanks!

> The intention is so that almost all extensions, especially the popular ones like tree style tabs, will keep on working; however they will need to be written to work with the new APIs.

This sounds doable.

> but IIRC regular firefox addon APIs will not be removed until the corresponding APIs exist in the webextensions format.

Again, this should sort it.

My worry was if we would be left with only chrome-like or worse browsers - for me Firefox currently plays in different league.

Edit: you possibly see why I and others were mistaken; you are the first one I know that has explained this. Feel free to tell relevant people why we misunderstood so we can clear up the FUD.


> Edit: you possibly see why I and others were mistaken; you are the first one I know that has explained this. Feel free to tell relevant people why we misunderstood so we can clear up the FUD.

Right. There's been a lot of FUD on this and not enough has been explicitly said. I'm pretty sure this was clearly stated in some places, but FUD spreads easily :)

Maybe I can convince someone to write something about it.


> Maybe I can convince someone to write something about it.

You should come chat in #webextensions, if we can identify where messaging is falling down and fix it this would be great!

I think one thing missing right now is having good "release criteria" for WebExtensions (like we did for e10s) - I don't think implementing every single API that's available to old-style extensions right now is a tractable problem (since it's "every internal API", all the way up to executing arbitrary commands).


I had a chat in #devrel :)

yes, of course, its not every internal API but the bottom line is that most extensions should still work, right? :)


Can I still modify the entire UI with an addon, for example, to add tab previews on hover?

Or are you guys going to implement every single incompatible addon yourself?


I don't know, but I think the answer is "it depends on how many people want that". Sounds like an API that would be part of the API tree style tabs would need anyway.


Extensions are a form of vendor lock in, Google were killing Mozilla in that space because:

- network effects (chrome is more popular than ff)

- chrome extensions are so much easier to write than XUL ff extensions

Now webextensions are implemented in FF, it's incredibly easy to port extensions across (I ported one earlier this week, took about 20 minutes!)

Tree style tabs and all that jazz aren't going away yet.

It's a killer move from Mozilla and makes FF a first class citizen of the web again.


It isn't just that chrome extensions are easier to write than XUL ff extensions. It is also that XUL ff extensions are poorly maintained in various ways with more gotchas. Additionally, FF's extension signing was largely a mess and in less than a year we're transition from JPM to another build tool.

You can't even go from unlisted to listed on Firefox's addon site.


My question still stands though:

Will the new Firefox support real extensions or just wannabe-extensions like Chrome?


WebExtensions are a superset of what Chrome does. Things like tree-style tabs will be possible in Firefox with the new extension system, for instance.

There are already several WebExtension APIs implemented that Firefox supports that Chrome does not, the difference now is that APIs are being built specifically for extension authors to use, rather than Firefox internals being directly accessible (which certainly allows you to modify anything, but is intractable to maintain backwards compat and secure).

Many authors of popular extensions are working on specifying and prototyping the APIs they need.


The primary concerns I have, which I left Firefox over, are:

- Dropping XUL breaks backwards compatability and it seems Mozilla is willing to break it before they provide adequate replacement APIs

- The design of the new addon signature requirements turns AMO into a walled garden and I very much do not appreciate that


> Dropping XUL breaks backwards compatability and it seems Mozilla is willing to break it before they provide adequate replacement APIs

The problem is not XUL. Modifying HTML or XUL DOM via JS is roughly equivalent (setting aside XUL-only features like XBL, does this matter for many extensions?)

The problem is all the internal JS APIs that add-ons can call right now, there are too many to secure and ensure backwards compat. This is why extensions break between releases so much. It's also pretty hard to program against, so there are a lot of common bugs and it's very difficult to ensure any level of security, since Firefox extensions can do anything.

WebExtensions are intended to be a superset of the APIs Chrome exposes, new APIs are being added all the time. It must be possible to implement them securely and maintain them over time, unlike the current situation with internal-only APIs.

> The design of the new addon signature requirements turns AMO into a walled garden and I very much do not appreciate that

I disagree. Signing is required to make it more difficult for malicious extensions to persist in the wild. There's no requirement to host on AMO, just a requirement to sign if you want your extension to run in Firefox release builds.

If the extension is later found to be malicious, it can be revoked without having to depend on the ID (which is set by the add-on and trivial to circumvent).

> The primary concerns I have, which I left Firefox over

Which browser did you switch to? There's still time to participate and influence outcomes, old-style extensions are still supported today...


Technically you don't have to distribute through Mozilla, just submit for signing. While it's true they remain a gatekeeper it's not as bad as the IOS app-store.


The docs @ Mozilla refer to the "WebExtensions" API, as if it was a standard.

Google get's me nothing though, except the Chrome API and Operas copy thereof.

So Mozilla basically just said: "Let's just copy the Chrome APIs. Done"

Don't get me wrong, I've written some Chrome extensions, and I think the API is mostly pretty great. But wouldn't an actual standard be appropriate?


Standardization would be a useful outcome, having multiple implementations shipping is a prerequisite as I understand it - it's something Mozilla could participate in, along with other interested browser vendors.

There's a community group: https://www.w3.org/community/browserext/ but no WG yet as far as I can tell.


Seems like a tool about to be used in the Mozilla Context Graph effort, discussed on HN last week [1].

[1] https://news.ycombinator.com/item?id=12044212


Is this the tech behind Firefox Reader View?


Not exactly, though inspired/reacting to that approach. The reader view is based on this: https://github.com/mozilla/readability


Which itself turns out to be contributed by the community in 2010. Open source in action.


I don't quite understand why is this useful, but someone could turn it into a browser extension.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: