Hacker News new | past | comments | ask | show | jobs | submit login
How does Firefox's Reader View work? (2020) (videoinu.com)
244 points by pmoriarty on March 31, 2022 | hide | past | favorite | 102 comments



>Getting author information a.k.a. byline relies primarily on either on the correct appearance two attribute=value pairs rel=author or itemprop=author, or a regexp soup matching potential byline matches.

The heuristics are indeed brittle, which is why I had to add:

    <span class="work-around-firefox-reader-mode-bug" rel="author">&#8204;</span>
... to my website. Without that, Reader Mode would look for the first HTML element that contained the string "author", assume its content was the name of the author of the article, and make it a byline right below the title. Of course the element it picked was a section header, with the content "Completing an authorization". So not only did I end up with a blog post written by "Completing an authorization", but also the section that was supposed to have that header no longer had a header.

So the hack is to add another element before the website content that it picks instead. &#8204; is a ZWNJ, which I used because Reader Mode is too smart for its own good and will ignore an element that has empty text content or just an NBSP. So a ZWNJ is the next-best thing.

But it's great when it works, and I'm glad it exists. I have JS disabled by default and have lost count of the number of websites that use CSS to hide their content and JS to make it visible. Reader Mode defeats them easily.


"&#8204; is a ZWNJ, which I used because Reader Mode is too smart for its own good and will ignore an element that has empty text content or just an NBSP."

ZWNJ and its other zero-width friends is my ninja secret. Oh, are you demanding this text box be filled with something? No thank you. Have a ZWNJ.

Few, if any, sites that are "checking" for content being in some widget "correctly" filter out the entire Unicode space class. (I mean, I assume somewhere out there there is one, but so far this has never failed me.) Most of them just filter out ASCII space 32. A few are smart enough to filter out tabs as well; if Reader is also filtering out NBSP that already puts it at the head of the class. I'm not sure I've yet encountered one that is filtering out the entire Unicode class.

(On the other hand, I would submit that by the time someone has taken the time to find one and copy & paste it into your text widget, you should consider your options carefully. The user has just demonstrated significant technical sophistication. It may not be your wisest move to also filter that out. You want some Zalgo emojis? Because I can give you Zalgo emojis. You may not want Zalgo emojis; there are still some systems out there that emojis can break. You want to find out?)


Why use a ZWNJ and not just, well, an actual author name? I'm probably missing something here.


Because it's my website and I get to decide whether to have an author name or not.

It's bad enough that browsers assume my website, that's made up entirely of free-flowing text inside p tags, won't fit on phone screens unless I add a meta viewport tag. I'm not going to also make changes to the content to satisfy them.


> It's bad enough that browsers assume my website, that's made up entirely of free-flowing text inside p tags, won't fit on phone screens unless I add a meta viewport tag.

This anoys me to no end as well. It's especially ridiculous now that mobile browsers provide a "Desktop" mode that can be used as a fallback for those sites that do need a larger page width.


"Without that, Reader Mode would look for the first HTML element that contained the string "author", assume its content was the name of the author of the article, and make it a byline right below the title. Of course the element it picked was a section header, with the content "Completing an authorization"."

Why would a section header in your article contain "author" if you didn't mean for it to designate who the author of the article was?


> Why would a section header in your article contain "author" if you didn't mean for it to designate who the author of the article was?

It's bad substring matching. It's "authorization", but apparently Firefox just interprets that as "author".


Oh, I see. Yeah, that's a bug in the design, because it's trying to infer metadata from unstructured content.

There's no reliable way to do it. The right way would be to only use explicitly defined metadata and if it's not there to default to showing nothing.


Presumably there is a

    class="authorization"

or something like that in the enclosing element, so it's not that they're inferring metadata from unstructured content - it's that they're drawing bad inferences from the metadata provided.

That said, inferring metadata from unstructured content is literally the whole goal of Reader mode - to make pages more readable even if the original source didn't design it to be - so while this particular bug is avoidable, others may not be.


"inferring metadata from unstructured content is literally the whole goal of Reader mode"

Is that the whole goal, or is it to get around the design choices of the original?

These are two different things.


We can call it wrongly structured. It might be unstructured, or it might be badly structured due to design choices (including user hostile ones).


Interesting bug. I bet they'd be willing to accept a PR to tweak that logic that searches for "author".. or i hope they would. For example by looking within word boundaries (\b)


> for instance, any node with class skyscraper is considered hidden (maybe the logic is that those elements are shrouded by clouds).

Ha, you wish. Skyscraper is the name of a standard Internet advertising unit. It’s fairly common for adblockers to block it. https://www.marketingterms.com/dictionary/skyscraper_ad/


Is it still being used?


Yes, it's a standard industry term and still used on many sites.


It's really nice that browsers offer reader modes, but they are frustratingly incomplete.

Really, let's switch to a user perspective once and consider - what if I always want reader mode? This is technologically complete impossible and all the solutions are a band-aid.

Firefox and others' attempts rely on the page authors' goodwill. But some pages will always attempt to frustrate reader modes.

Alternative approaches for content extraction use machine learning such as [1], but they of course need to be updated for culture- language- and technology-specific changes.

It's a mess and will remain so for the foreseeable future.

[1] https://github.com/dragnet-org/dragnet


It's a mess by design. In an ideal world no browser would need a reader mode, because pages themselves would be reader-friendly by design. Page authors, of course, are instead more interested in optimizing for advertising and tracking. Reader mode, much like adblock extensions, works directly against their interests.

In fact we wouldn't even need any new web standards to implement something like this. Any website following the latest Web Content Accessibility Guidelines and using the correct page structure and aria- tags should be fully semantically parsable and displayable in any format or medium.


There was a lot of stuff/buzz around the semantic web in the early 2000s.

I’ve been out of web for a long time but it seems to of mostly died with the advent of React and the single page web application. Now it looks like everyone apart from public government sites with strict accessibility requirements just don’t bother putting the effort in.


I think SEO is a large reason why popular sites do bother with semantic tagging.


True, but at the same time there's an incentive against making semantics too good. The separation of data from presentation competes with the need/desire to monetize.

For example, somebody had build an app that scrapes free recipes from the web, so that it's shown without ads and the ridiculous fluffy lectures that are typically part of it.

Great for users of that app, not so great for the recipe authors depending on this model.

This is also why RSS is dead. Both Twitter and Facebook had it before, but killed it. You don't want to give away your goods for free.

It's easy to put this problem at the recipe bloggers or Facebook, but we're just as responsible for this. We don't want to pay for anything so it's ads. Then we avoid and block ads.

We lack a payment culture.


This is one of the problems, and it might be the biggest problem with the modern web, but it's not the only problem. Non-experts are bad at producing remixable content.

* A significant number of people do not really use file names. https://jayfax.neocities.org/mediocrity/gnome-has-no-thumbna...

* Twitter offers no official way to make bold or italic text, but some people make it work using math symbols. They either don't know, or don't care, about all the accessibility and search tooling this breaks.

* Try downloading a C program off the internet and compiling it on anything other than the original developer's computer.

In none of these cases does anybody benefit from the lack of usable metadata, yet nonetheless the metadata doesn't exist.


You're absolutely right. The vast majority of web content goes undiscovered because it lacks any and all metadata.


Unsolicited advertising should be made illegal.

Then there'd be no reason to optimize the web to serve up ads, because legally you couldn't.

The world needs to switch away from advertising for the good of humanity.


Hey, fine with me. But do come up with an alternative monetization strategy.


Paywalls, donations, selling a service, kickstarters, have something physical to sell, etc..


These are (sometimes) good solutions for highly specific high value offerings, but that still leaves some 95% of the rest of the internet to cover ;)


I'm sure they can live without my views. If I can't read the page I can always press the back button. I'm not going to buy anything from any ads anyway so no loss on either side except a few seconds of time lost for me.

Just got a link to a documentation for an API that I can't read because of too low contrast. Dark Reader plugin nicely fixed it so didn't even have to try reader mode. It sucks to get old anyway, you young people don't have to make it worse. You will be in the same seat in 20-odd years so plan for it :-)


"I'm not going to buy anything from any ads anyway"

The purpose of ads is often to get you to just recognize the brand (and/or associate positive feelings with it) so that if you are ever in the market for their products you'll be more likely to choose their brand over a "no-name brand" (ie. one that you haven't seen ads for).

So you might think you're immune to advertising, but you probably aren't.


Yes, That's why I use adblockers everywhere but that sometimes breaks the page and then I press the back button.


I offset these impacts by actively avoiding brands I recognize from unsolicited ads, and I encourage everyone to do the same.


"In an ideal world no browser would need a reader mode, because pages themselves would be reader-friendly by design."

This would happen if page authors just participated fully in the Semantic Web.

https://en.wikipedia.org/wiki/Semantic_Web


"sometimes known as Web 3.0"

:(

The future we could have had, but did not get


But now we're getting Web3 instead, with blackjack and blockchains.


I have a rudimentary always on reader mode: I turn off CSS and JS by default. Once I scroll past what would be the billion custom button gifs and menu items I'm left with beautiful, clean text.

I turn the stuff on on a per site basis when I want to use an actual web app, something like github, and every now and then I get an article that wants to fight me on it and my usual response is "well I guess you don't want to influence me with your words bad enough" and I back out. But it works 99% of the time exactly how I expect it to.


Me too. If I could change one thing about modern web dev, I would make the default size for SVG icons something like 32x32 pixels. Many sites use massive icons more like 1000 pixels wide. Also, animated grey backgrounds in placeholders are obnoxious.


Another approach to completeness could be to remove noise from the original page instead of parsing just the text from it. In the worst case the page isn't changed at all but it's still usable (like when ad blockers miss some ads). [0]

I assume you only want always-on reader mode for articles -- and detecting what's an article is another NLP problem. Yet both the completeness and article detection can probably be solved through heuristics in 90% of cases (the evidence is that we DO use reader modes). Maybe it depends on how much the last 10% frustrate you.

[0] I'm working on a browser extension that does this: https://github.com/lindylearn/unclutter


A reader mode's job isn't to be 'complete' in some sense. You have addons to do anything you want to pages, or you can just run your own js on any given page.

Reader mode is a reasonable low-overhead script-stripper that mostly does its job. My expectations aren't high, but it always delivers.


"You have addons to do anything you want to pages"

Except not on Firefox Mobile. This has been the most frustrating regression I've experienced, suddenly losing my favorite things about Firefox.


And there's no reliable way on Firefox Mobile to force reader view, either. This was possible on earlier versions with the about:reader?url= hack but not any longer.


> what if I always want reader mode? This is technologically complete impossible and all the solutions are a band-aid.

Sigh… welcome to Opera versions 9 to 12. Not only did this complete technological impossibility exist, it did so over a decade ago.

Tools → Preferences… (ctrl+f12) → Advanced → Content → Style Options… → Presentation Modes → Default mode [User mode]


The problem is that running in this mode can cause some web pages to break by default. Firefox doesn't even show the reader mode button unless its heuristics think it will work. Sometimes, when I put something in Pocket, it gives me a one-sentence abstract, but hides the entire rest of the article.

Why do you think Opera got rid of it?


> Why do you think Opera got rid of it?

http://enwp.org/Opera_(web_browser)#History

Opera A.S. threw the whole browser with its countless innovations and usability affordances away. The next major version after 12 was a Chromium derivative.


This was one of the sadder things to happen to internet technology. Opera used to be miles ahead of the other browsers, and they threw it all away.

Yet another example of "worse is better".


one thing that i saw on hn a while ago was a bookmarklette to remove all the junk from the page... quite useful and i use it daily typically. basically just removes any stickied or fixed positioned junk. surprisingly effective.

javascript:( function(){ let i, elements = document.querySelectorAll('body *'); for (i = 0; i < elements.length; i++) { if(getComputedStyle(elements[i]).position === 'fixed' || getComputedStyle(elements[i]).position === 'sticky') { elements[i].parentNode.removeChild(elements[i]); } } document.body.style.overflow = "auto"; document.body.style.position = "static"; })()

that's the bookmarklette i can gist it if someone wants a nicely formatted version but either way it works.


I specifically wrote always-available, "sticky" (default: stays Reader View even if you go to another page) and automatic (goes auto on subpages or entire domains, as you specify) Reader View support built-in to TenFourFox to help out on slower systems.


I guess AMP was the closest we'll get to an always-on reader mode. Not sure how popular it is but at my workplace there's plans to deprecate our AMP support. I'd really like a desktop browser to take advantage of it.


> But some pages will always attempt to frustrate reader modes.

So STOP READING THOSE PAGES!!!

the reason the pages are full of ads is because people keep reading them.


I don't read the internet just to enjoy the wonders of efficient, plain web design.

I'm not gonna stop reading the newspapers or blogs I consider most trustworthy (or less bad) just because I don't like their design choices!


The opposite could be true: the fewer people reading, the more additional adverts needed to maintain the page.


I agree, but I'm a researcher, and while I don't read those pages I still need to get the content out in some cases [1].

[1] Meaning I've spent in the ballpark of five figures in power bill extracting content from html. Not my personal money, of course.


If only we had some descriptive language to mark up the actual, readable content of a web page in the simplest form and then have all the auxiliary content, table-of-contents, dynamic functionality, and styling described separately so that the browser would know what content is essential and what is optional...


If you're interested in playing with Mozilla Readability.js from the command line I came up with this recipe using my shot-scraper tool the other day:

https://til.simonwillison.net/shot-scraper/readability

    pip install shot-scraper
    shot-scraper install
    shot-scraper javascript https://simonwillison.net/2022/Mar/24/datasette-061/ "
    async () => {
      const readability = await import('https://cdn.skypack.dev/@mozilla/readability');
      return (new readability.Readability(document)).parse();
    }"


shot-scraper is terrific, thank you!

Right now I'm using a bash script instead of YAML file for screenshotting multiple sites I maintain, as there is no option to add a timestamp to the filenames, something like this (simplified):

  declare -A capture
  capture[www.foo.com]=https://www.foo.com/
  capture[www.foo.com-bar]=https://www.foo.com/bar

  for key in "${!capture[@]}" ; do
    shot-scraper ${capture[$key]} -o - > $key_$(date +"%Y-%m-%d_%H_%M")
  done
Otherwise it is a terrific tool, able to screenshot websites that are otherwise very difficult due to CDNs, javascript, etc. Thank you very much!


Thinking about this and the Gemini protocol, it seems to me that HTML has become too user-hostile, or at least it gives too much rope for site developers to hang users.

The way I'd solve this problem would be to develop a new file format that was similar to Markdown but contained semantic elements (eg "this image is relevant to this bit of text", or "this text explains that text".

This would be rendered into HTML by a browser extension so the browser could display it, which would help keep authors honest (because if the server renders it, we're just using HTML again).

The format would contain no styling information at all, apart from the semantic tags, so styling would be entirely on the client. You could pick the way you wanted your articles to look, which would probably be different per device.

This sounds a bit like epub and would probably be great for book reader devices as well, while being extremely light to load because there isn't much you can do with it.


Actually, HTML hasn't changed much for many years now. Which is kindof the problem - everything around it (mostly CSS, and also JS) had to change ad absurdum to exercise HTML into a scene graph for web apps.

It's not like W3C and others hadn't tried to renovate HTML - that's what W3C's XML initiative was about in 1998, with a generic namespacing mechanism in anticipation of a wealth of new vocabularies (of which SVG and MathML, but not XHTML made it).

I'd argue W3C people were so wound up in XML and "enterprise" tech for like ten years that the web stack was unprepared when the iPhone with requirements for mobile sites came around, and W3C became an easy target to take HTML away from.

Mentioning this to warn against inventing new meta "formats" all the time - everything needed for HTML was already in existence before 1986 when SGML (on which HTML is based) was published.

Including markdown and custom shortform syntax. Large parts of markdown can be specified using SGML shortref, yielding a shortform syntax that expands to HTML proper, capturing exactly the way markdown was originally specified.


Unfortunately, it seems like a technical solution to a society-wide human problem to me.


Tech people love throwing technology at social problems.

Plain HTML is great for documents. It’s just the browsers‘ failure to render it nicely that makes authors feel like it’s necessary to style to make it presentable.


"The way I'd solve this problem would be to develop a new file format that was similar to Markdown but contained semantic elements (eg "this image is relevant to this bit of text", or "this text explains that text"."

It sounds like you're reinventing the Semantic Web:

https://en.wikipedia.org/wiki/Semantic_Web


I think somebody came up with something similar to this called Aftertext(1). I remember it as I found the concept kinda interesting.

Unfortunately the easy part is designing the protocol, the hard part is universal rendering and adoption.

(1) https://news.ycombinator.com/item?id=29643313


Hmm, that's an interesting idea, though I think the format looks too cumbersome. Still, that's mostly what I meant, thanks for the link!


The look of HTML pages used to be up to the end user, W3C should never have allowed non-logical HTML and other ways of breaking design. When a new standard is invented as a remedy, it will suffer the same fate as HTML as long as this problem is not taken care of. AFAIK, the way to take care of it is to enforce semantic markup in all aspects of a page and leave the rest to the end user. The browser should know things like "this is a large list of strings that the user can choose", "this is an image with the following description", maybe even author-intended display preferences, but not more than that.

I think the overall best solution would be to switch from a layout language to a programming language where each program manages special input and output variables that are marked semantically and given display and layout hints. No HTML, no CSS, just programs on a virtual "semantic display" with input and output of data to the user.


I found readability pretty useful for building a plain text version of articles. From there, I'd love to take a RSS feed and transform it to ebook.

https://earthly-tools.com/text-mode

https://news.ycombinator.com/item?id=30829568


I'm surprised that more effort hasn't gone into hacking on "line mode" (terminal-based) Web browsers—esp. for reading developer documentation—and absence of the resulting effect that this should have among hackers on their own sites.

I find most people who consider their projects to be "no-frills" have quite a few frills indeed. This includes personal blogs, but Sourcehut and the constant praise for its "clean" UI is one other example of this sort of thing. Load it up in w3m or Lynx and it very much feels like a site intended to be read in a conventional desktop browser but filtered and constrained to a text-only medium.

Actually clean design would start with, "How would I expect this to feel if I made a custom, terminal-based (but not command-line driven) Sourcehut client?" and then figuring out the appropriate markup you'd need to spit out to achieve that in a line mode browser that understands the basics—forms, content ordering, etc.

(PS: Now transport that same artifact to a conventional browser. Is the experience better or worse that what's currently available?)


Evernote (remember them?) had a pretty good tool called Clipper that would save a reader-version of the page to Evernote. It was pretty great, but I haven't used it lately.


Ha, I just worked through this a week ago: https://searchfox.org/mozilla-release/source/toolkit/compone...


I implemented a variation of the Readability algorithm some 9 years ago, in case anyone needs a server-side Python version and is interested in dragging it (kicking and screaming) into the 2020s:

https://github.com/rcarmo/soup-strainer


Mozilla’s readability library is also usable on the server with Node.js and JSDOM.

That’s if you are fine with using JavaScript on the backend.


My Hacker News client HACK for iOS and Android has reader mode ability browser. It can be set to be the default for hacker news sites in the app too. While on iOS, I was able to use the reader mode feature provided by SFSafariViewController, that wasn't available on android.

So I had to read a ton about this. I ended up using a heavily modified Kotlin version of Readability:

https://github.com/dankito/Readability4J

https://play.google.com/store/apps/details?id=com.pranapps.h...

https://apps.apple.com/us/app/id1464477788


Thanks for Hack, it's what I'm currently using!


Readability is amazing. I find it works most of the time.

Currently I'm using it to parse a web article, package into a .mobi file and send it directly to my Kindle Paperwhite.

If you want to read your favorite content on Kindle, give it a try[0]

[0]: https://ktool.io


For websites, do you have examples where it outperform Amazon's own solution? https://www.amazon.com/sendtokindle/


KTool[0] is fairly new so I can't claim it "outperform" Amazon's solution. However if you read the recent reviews from their Chrome Extension page[1], the majority of complaints was that the article never arrived.

I just tried sending this article[2] and 10 minutes later it still hasn't delivered yet. This article contains a lot of images, so I understand it's not going to be quick, but with KTool, it does deliver within 2 minutes.

But that is not the only reason why I build KTool. I don't want the articles to include links when reading on my Kindle because it's bad UX (easy to mis-tap), distracting and reduce comprehension (see here[3] and here[4]). Instead, I push all these links to the bottom of the article, which make it still accessible yet improve the reading experience.

[0]: https://ktool.io

[1]: https://chrome.google.com/webstore/detail/send-to-kindle-for...

[2]: https://waitbutwhy.com/2017/04/neuralink.html

[3]: https://www.wired.com/2010/05/ff-nicholas-carr/

[4]: https://www.roughtype.com/?p=1378


Update: 5 hours later, no sign of the article being delivered. It think it's safe to assume the Amazon solution doesn't work ¯\_(ツ)_/¯


This is amazing! Do you have any ideas on how to use it to send content behind paywalls to the Kindle?


Well, I won't implement such feature since I don't want to be part of a legal dispute.

But I'm working on a feature where you can send content from client side. Meaning you can use any other extensions to modify that page's content[0]. KTool Extension will take that content as input, parse and then send to your Kindle.

[0]: https://chrome.google.com/webstore/detail/bypass-paywall/kko...


I wasn't actually suggesting bypassing paywalls (in case it came across like that) but rather taking content that a user has access to but that is not accessible by a third party - such as paywalled content, content behind login screens, etc. An extension seems like a natural way to accomplish that.


Ah gotcha. Yes, I'm already working on a browser extension for that. That's also the main differentiator to my competitors.

Hey you can sign up for an account to receive updates, or just reach me directly via email. My email is on my profile. Thx


Wondering if this (plus Safari's reader mode heuristics if those are any different) could form the basis of a much-needed reduced HTML subset. Like, for the HTML we're actually searching, as opposed to the trackfest and seo'd articles search engines are giving us.

Edit: so much for the illusion of "semantic HTML" ie where you need heuristics and are entering an arm's race vs publishers to make your HTML even readable


Does anyone know how difficult it would be to develop a reader view only browser?


Well, there's a few options out there for something like this, it depends on what you're going for specifically.

There's Gemini and gopher, protocols designed around transferring and rendering plain documents.

There are terminal based browsers like Lynx that render pages in a terminal, of course it's all text on your end.

If you're talking about a GUI program to do what reader view does and only that, the reader view code is open source and available from Mozilla, I'm sure it wouldn't be much to build a webview app for mobile or a simple GUI that sits on top of curl or wget or something like that, fetches the page, processes a URL and renders the text. You're probably going to be manually entering URLs though, I'm not sure in the modern web how you even click from one site to another, how you even wind up at a URL for a written article using only reader view.


I think the problem here is not on the technical side. The places where we find stuff to read (aggregators, homepages, social networks) just don't work well if you're only extracting the text and links.

There's also a huge long tail of parsing issues because web pages are not static documents. You'd want a fallback to the original HTML -- so why not use your main browser? There are extensions that activate reader mode automatically on pages where it's supported [0].

[0] I'm working on one of them: https://github.com/lindylearn/unclutter


There are a couple of (not very popular) plugins that can set Reader view as default. One of them is Automatic Reader View[1] which has an option to to show all pages in reader-view.

1. https://addons.mozilla.org/en-US/firefox/addon/automatic-rea...


You don't need to. All you'd have to do is hack up one of the reader view extensions so that it's triggered when loading a page rather than by clicking a button.


I guess OP eyes a reader mode-only browser as an achievable milestone on the way to develop a browser (something sorely missing in "web standards" only serving to pull up the ladder), rather than as an app from a user PoV.


This, I could've been more specific. I basically just want to design a simple HTML4 (or doable parts of HTML 5 ) layout engine to render static documents and avoid any dynamic components.



Try about:reader?url=YOUR_URL_HERE. If that does not work, it means the URL does not support it.


Sadly, no longer works on Firefox Android. I used to use this trick a lot.


This seems to be a prime example of the Pareto Principle, or the 80/20 rule. About 12 years ago, I had built something to do something similar and published how I did it [1]. It was a very simple process, which my younger self called "the meat algorithm", as in, how to get the meat of an article.

It was far less code and worked perfectly 95% of the time (though, the average web-page was also a little simpler 12 years ago). But that code would have quickly ballooned out if our use-case had called for addressing the other 5% of webpages, as the Firefox Reader View must do.

[1] https://www.alfajango.com/blog/create-a-printable-format-for...


I haven’t directly compared them, but I have also found mercury parser (https://github.com/postlight/mercury-parser) to be pretty reliable. Its advantage is that you can directly pass it a page instead of having to give it a DOM.

Since it turns a website into very plain (X)HTML it‘s fairly easy to use it to make a browsing proxy or automatically produce epub files for e-readers, which is what I do.

Edit: Here’s the proof of concept type code I use: https://gist.github.com/solarkraft/d6306f17a761fcb5ce47f2be7...

It’s a bit crappy, but it works for me :-)


I use safari and have it set to use reader view by default. Much more pleasant!


For those wondering if there's a redability lib in their favorite language. Here's a list of them all (as far as i know) plus the original arc-90 implementation

https://github.com/masukomi/arc90-readability/#readability

Please submit a PR if there's something i don't have listed there.


I used Firefox for 19 years without interruption until last summer when I switched off of it. Its reader view was one of its best features. It works on more pages than competing browser's reading modes. While they removed most of the features that I had grown to love over the past two decades, another standout for me was the ability to have multiple picture in picture instances running.


Can this engine be used to build something like pocket? Does pocket use the same tech to fetch content of pages ?


I'm always amazed at how Firefox's reader view circumvents so many javascript paywalls.

I'm surprised at how incredibly mono-cultured web developers have become in testing browsers other than Chrome.

OR .... they know, and they don't care ....


Often they don't care because this makes it easy for Google to still index their content and blocks most humans.


Thank you for this! I've always wondered where this code lives because I love reader view and I'd like to use it as a base style for my blog.


To HN users reading this comment, where did you find the best reader view implementation?

Seems like it's a really difficult task.



I think it won't work on techcrunch, since it somehow blurs the page.

But generally it's awesome.


I've been working on several web extractors project, so I think I could share some of my findings while working on them. Granted it's been several months since I worked on it so I might be forgetting some things.

There are several open source projects for extracting web contents. However, there are three extractors that I've worked with and give us good result:

- readability.js[1], web extractor by Mozilla that used in Firefox.

- dom-distiller[2], web extractor by Chromium team, written in Java.

- trafilatura[3], Python package by Adrien Barbaresi from BBAW[4].

First, readability.js, as expected is the most famous extractor. It's a single file Javascript library with modest 2,000+ lines of code, released under Apache license. Since it's in JS, you can use it wherever you want, either in web page using `script` tag or by using it in Node project.

Next, DomDistiller is extractor that used in Chromium. It uses Java language with whopping 14,000+ lines of code and can only be used as part of Chromium browser, so you can't exactly use it as standalone library or CLI.

Finally, Trafilatura is a Python package released under GPLv3 license. Created in order to build a text databases[5] for NLP research, it mainly intended for German web pages. However, as development continues, it works really great with other languages. It's a bit slow though compared to Readability.js.

All of those three work in similar way: extract metadata, remove unneeded contents, and finally returns the cleaned up content. Their differences (that I remembered) are:

- In Readability, they insist to make no special rules for any website, while DomDistiller and Trafilatura give a small exception for popular sites like Wikipedia. Thanks to this, if you use Readability.js in Wikipedia pages, it will shows `[edit]` button thorough the extracted content.

- Readability has a small function to detect whether a web page can be converted to reader mode. While it's not really accurate, it's quite convenient to have.

- In DomDistiller, the metadata extraction is more thorough than the others. It supports OpenGraph, Schema.org, and even the old IE Reading View mark up tags.

- Since DomDistiller is only usable within Chromium, it has the advantage to be able to use CSS styling to determine if an element is important or not. If an element is styled to be invisible (e.g. `display: none`) then it will be deemed unimportant. However, according to a research[6] this step is actually doesn't really affect the extraction result.

- DomDistiller also has an experimental feature to find and extract next page in sites that separated its article to several partial pages.

- For Trafilatura, since it was created for collecting web corpus, it main ability is extracting text and the publication date of a web page. For the latter, they've created a Python package named htmldate[7] whose only purpose is to extract the publication or modification date for a web page.

- Trafilatura also has an experimental feature to remove elements that repeated too often. The idea is if the element occured too often, then it's not important to the reader.

I've found benchmark[8] that compare the performance between the extractors, and it said that Trafilatura has the best accuracy compared to the others. However, before you start rushing to use Trafilatura, you should remember that Trafilatura is intended for gathering web corpus, so it's really great for extracting text content, but IIRC is not as good as Readability.js and DomDistiller for extracting a proper article with images and embedded iframes (depending on how you look, it could be a feature though).

By the way, if you are using Go and need to use a web extractor, I already ported all three of them to Go[9][10][11] including their dependencies[12][13], so have fun with it.

[1]: https://github.com/mozilla/readability

[2]: https://github.com/chromium/dom-distiller

[3]: https://github.com/adbar/trafilatura

[4]: https://www.bbaw.de/en/

[5]: https://www.dwds.de/d/k-web

[6]: https://arxiv.org/abs/1811.03661

[7]: https://github.com/adbar/htmldate

[8]: https://github.com/scrapinghub/article-extraction-benchmark

[9]: https://github.com/go-shiori/go-readability

[10]: https://github.com/markusmobius/go-domdistiller

[11]: https://github.com/markusmobius/go-trafilatura

[12]: https://github.com/markusmobius/go-htmldate

[13]: https://github.com/markusmobius/go-dateparser


I use this in a proxy to be able to browse the web on my Amiga 1200




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: