Hacker News new | past | comments | ask | show | jobs | submit login
TeX line breaking algorithm in JavaScript (and HTML5 Canvas) (bramstein.com)
86 points by bramstein on Feb 18, 2010 | hide | past | favorite | 56 comments



Why current browsers are not using this, or some improved version of this algorithm?


Because a) first fit, best fit or something else is trivial to implement, having a good total fit implementation that works in a lot of circumstances is not trivial (I have re-implemented a simpler version of the algorithm in Ruby including using a hyphenation library). b) you can ask the same question: why does almost no word processor implement this algorithm? And with word processors it is much more important! The answer is simple: Most word processors are doing many things badly so having a total-fit h&j (hyphenation and justification) algorithm would only make the output slightly better but still much inferior to good book typography.


why does almost no word processor implement this algorithm?

In TeX the line breaking depends on the whole paragraph. I imagine it could be visually quite distracting if the line-wrapping jumped around in a WYSIWYG word processor as you added text to a paragraph.

Edit: ah. I see this is noted in the posted article’s To-do section:

Figure out how to deal with dynamic paragraphs (i.e. paragraphs being edited) as their ratios will change during editing and thus visibly move around.


Try editing paragraphs in InDesign. In practice, it’s really not much of an issue, because paragraphs don’t actually jump around all that much, and to the extent that they do, it’s really not particularly distracting.


I tried it in Photoshop which since CS2 (or so) has an implementation of the Knuth & Plass algorithm. I found it quite distracting myself. Granted, I was on the lookout for text jumping around and had hyphenation disabled.


What about a menu (ribbon) item called "apply very nice typography to your document" which recalculates/reformats the paragraphs? Would this be a solution?


This is exactly why I prefer authoring documents in LaTeX vs WYSIWYG tools: I can concentrate on semantics and not worry about representation. Writing stuff in Word forces you to be distracted by its appearance on the page and can drive me crazy with micro-adjustments. With LaTeX, I just write the document and the computer does what computers are good at doing.


> I just write the document and the computer does what computers are good at doing.

Which is: Interpret your instructions very literally, so you still need to go over your document carefully to fix problems like urls running off the edge of the page?

Smart-arse joke aside, I agree with you that LaTeX is generally great, but it's not perfect, and sometimes it is difficult to make it do what you want.


Maybe. But there are probably all sorts of annoying details to deal with. For example, the pagination of the document could change, and so that would lead to further editing to fix it. Then how does line-breaking work during that editing phase, or in general after applying the menu item?

I don’t actually use a WYSIWYG editor any more, so I’ll leave my speculation there.


> why does almost no word processor implement this algorithm?

Because word processor implementors are lazy and incompetent? ;-)


Yes, I think that is the reason. Or perhaps: the users don't raise their voices to have something better?


> Or perhaps: the users don't raise their voices to have something better?

I think that probably most people don't realize there is something better, and can't even really tell the difference between typical word processor output and nice TeX-style optimal line breaks. It seems to me most people don't care that much about typography.


Alas, typography is one of those fields where most people only ever notice when you get it wrong.

A document presented with good typography might be easier to read for lengthy periods without losing concentration. There might be fewer distractions, like rivers of space running through the text, hard-to-read shapes due to poor kerning, or chunky CAPITALS and long numbers where small caps and old-style figures would not have disrupted the flow. There might be subtle visual cues to help the reader understand the material more quickly, like moving captions and headings closer to the subject material or spacing out bullet lists a little so their items are clearly separated.

But in a world where most people using word processors don't know what a stylesheet is and emphasis tends to consist of centring text, setting it in bold, capitals, and double-underlined, and then hitting enter a few times either side to space it out a bit, I think we can safely assume that most people either don't know about the subtleties of good typography or just don't care. The world might be a slightly better place if leading word processors all adopted better typographical conventions by default, and a few of us would appreciate the ability to produce better quality results, but I'm afraid it's never going to be a selling point for most people.


Word processors may be a lost cause, but we're in the middle of an e-book war (apparently), surely someone is going to display standard epub books better than anyone else and steal that all important typography snob demographic?


The claimed reasons are (a) it would take work to implement, and (b) it would be slower than the current method.

Both seem like pretty weak excuses to me.

The main real reason is likely that most users don’t care, and so they can get away with it.


(b) is not correct. The time complexity of the total fit algorithm (as described in Knuth&Plass' paper) is almost the same as the simple (first-/best-fit) method. They use a technique called dynamic programming to divide the problem into pieces where the perfect line break of a sub paragraph is valid for optimal breaking of the whole paragraph.


It is definitely slower. The computational complexity of the Knuth-Plass algorithm might be of similar (or maybe even the same) order as just fitting as much as possible on each line, but it definitely at least has a different constant out front. I’m not exactly sure what the speed difference is, but it’s non-negligible.... though on modern hardware either way is absurdly fast, and there’s no reason the Knuth method couldn’t be done in real-time even for quite large documents.


If you don't know the constant, how can you say that it is non-negligible? I doubt that one can measure the difference in speed rendering an average webpage with either algorithm.

And your remark about real time: Even the old (and slow!) TeX can do linkebreaking for several hundred pages of text in a second. Including writing pdf. So you are probably right about real-time h&j even for quite large documents.


Okay, according to this paper

http://delivery.acm.org/10.1145/1610000/1600217/p99-hurst.pd...

Knuth-Plass takes O(kn) where n is the number of words and k the number of words per line. That’s definitely slower than the naive line-by-line method.

But the better algorithm can be supposedly improved with the method in this paper (I haven’t read it and don’t have time just now, so I’m not sure how much slower it would then be than the naive way).

http://delivery.acm.org/10.1145/150000/146656/p546-eppstein....


There is also a paper that claims linear time for formatting paragraphs using dynamic programming: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.47.3...

I haven't really wrapped my head around the paper yet, but I think it only holds if you take out some of the elements of the original Knuth & Plass algorithm (the box, glue and penalty model.) Nevertheless, linear time seems possible for a total-fit algorithm if you're willing to give up that model.

There is also a Haskell and C++ implementation (with source code) available on the authors web page: http://progtools.comlab.ox.ac.uk/members/oege/publications/s...


Nobody denied that it is slower in theory, but I still claim that you won't be able to measure the difference on an average web page. So time is not an excuse for not implementing this (and this was the original question).


Yeah, I never would have disagreed with that. Which is why I said “Both seem like pretty weak excuses to me.”

By “non-negligible” I meant something like 2–10 times slower (you’d have to test to be sure... I tried searching around but couldn’t find any direct speed comparisons). For typical pages it’s not going to matter, but if you have a page like the HTML5 spec, where it takes a half a second to resize a window now, it’s going to be noticeably slower using a more sophisticated paragraph composer.

Keep in mind, Safari doesn’t even do kerning, because they’re afraid it’d be too slow. (Also a ludicrous excuse for desktop hardware and typical web pages, IMO.)

And the Safari guys also use the same excuse (“it’s too slow”) for not doing real color management with CSS/HTML colors.


Before we make fun of the webkit folks for not doing the proper work because "it's too slow," remember that these are the same folks that we praise for making the fastest dang HTML rendering engine on the planet. (Or near enough.) Their culture of performance has significantly raised the standard of web browsing performance, and that's not something to take lightly.

I imagine it would have to make a fairly significant difference in rendering quality in order to make any speed losses worthwhile. Death by a thousand cuts and all that.


I agree with you now :-)

I would like to have two browsers of the same type, one with high(er) quality typography and one as it is now. And then compare the speed. That would be interesting.


Would it be a problem to send me these papers?

My email is on profile.

edit: I probably have access from academic network, but I am at home for a few days...



Thanks!


Can’t you just click the links?


I do not have access to acm papers from my current location and I will be here for a few days.


Oh, sorry. I think maybe these are ones that have full text available if you arrive via a google search? I don’t think I’m currently logged into anything special.


Funny thing!

When I clicked again just now, I was able to download this paper... I still don't understand what has changed.

Thank you for your time.

EDIT: And thanks to cpach!


For future reference, do

    ssh -D 1080 unixhost.yourinstitution.edu
then set your browser’s SOCKS proxy to localhost:1080.


Speed is a pretty important reason, and users not caring is also a pretty important reason. Users do care about speed, and don't care about absolutely ideal line breaking; so why trade off something they do care about for something they don't?


Well, it was an important reason in 1990, maybe, or even 1995. But we’ve had a couple of orders of magnitude improvement in computing speed since then. There’s no reason to favor some marginal speed advantage (tenths or hundredths of a second) to better layout.


Tenths of a second can be pretty damn significant. Every tenth of a second extra load time leads to approximately a 1% drop in sales, according to research done by Amazon.

One of the big points of competition between browsers these days is in speed. Part of the reason people switch from IE to one of the more modern browsers is because they are just that much snappier.


Maybe. Though the only time there’s going to be a noticeable slowdown is when you’re dealing with huge amounts of text. Like an amount that will take 20 minutes or an hour to read. At that point, an extra 10th of a second to render, in return for a much more pleasant reading experience seems like a no-brainer trade-off.

On a typical Amazon page the difference is going to be unnoticeable.


You can just do the fast algo. first, and do the more sophisticated one, when your browser is idle. (The text will jump around a bit, but it does so anyway while loading.)


In the end, most of the big browsers are open-source, so I guess you could try submitting a patch, but I agree with most the answers here about performance, laziness, and the fact that users don't care.


This is great news. However, having a Knuth & Plass line breaking algorithm without hyphenation is like sex without a partner. Pretty much useless. The good points you get from using a total fit algorithm is much less then you lose from lack of hyphenation. If hyphenation is implemented, this will be very great!


Well, hyphenation isn’t too hard, since there are reasonable solutions that can be used directly or easily ported. (There are existing javascript hyphenation libraries floating around.)

The real problem with all of this is that everything is being done in a canvas element, which means no text selection, no searching, etc. etc.


Hyphenation is definitively on my to-do list. As you said, it will improve the output of the canvas tremendously.

The canvas is just a way to show the output of the line breaking algorithm. Earlier I had an example on my website of a dynamically created DOM paragraph using the Knuth & Plass algorithm. That way, the rendering was done by your browser (i.e. select, searching, etc. was possible) while the line breaks were handled by the algorithm. Unfortunately I haven't been able to get it reliable on all browsers, so I took it out before I posted this item. It's still on my Github repository: http://github.com/bramstein/javascript/tree/master/src/types...


I'm definitely following this and waiting for when you get the browser rendering complete. The could be really great.


Now that Firefox 2.0 is history, you can reliably use soft hyphens to hyphenate HTML text, and there are JavaScript libraries to do the hyphenation on the client:

http://www.mnn.ch/hyph/hyphenation1.html


Does the TeX algorigthm reduce rivers in the text, or is that just a coincidence?

http://en.wikipedia.org/wiki/River_(typography)

edit: re-reading this seems to follow naturally from reducing the size of the gaps between words, which the algorithm does.


As you said: the Knuth/Plass algorithmn does not reduce or even detect rivers. Detecting rivers can be done in linear time, but reducing is very likely to put the algorithm in a higher time complexity class (which is almost linear with input length for the normal algorithm!).


Rivers are killed naturally except for pathological edge cases when you (a) do proper hyphenation, and (b) do something like the TeX algorithm for paragraph layout, and (c) adjust inter-letter space as well as inter-word space.


Impressive!

Doesn't work in iPhone, sadly. Not sure why not, because Canvas is supported, I think.

Might be possible to make text selectable, but looks like a huge chore: http://stackoverflow.com/questions/1451635/how-to-make-canva...


Older versions of Webkit canvas don’t do text I don’t think. Not totally sure.


I use the new Canvas Text API, which---I guess---isn't supported by the iPhone. It would be fairly straightforward to make it use the old text API. I chose canvas as an output format for this demonstration, the algorithm itself knows nothing of canvas, so it could be adapted to use whatever output format is suitable for a device.


Do you have any plans on intra-character spacing as well? I know this is done in print to reduce the word gaps in justified text but I'm not sure if the resolution on screen would allow it.

I have a ­ hyphenator using Liang's algorithm as a YUI3 module if you're interested.


If you used sub-pixel alignment, it might be worth it.

I just wish we could get 300dpi displays for our computers already, and stop worrying about the pixel grid. But, we're going to need some better techniques for resolution independent UIs before that happens. In particular, you will probably actually need better hinting for vector based UI graphics, so that the same vector graphics that can be used for high resolution displays can also be used for lower resolution displays while still looking good.


It’s worth it to use some inter-letter spacing anyway, even if the spaces can only be whole numbers of pixels, because it avoids exceptionally wide or narrow word spaces.


Not at the moment. Intra character spacing or Microtypography is a bit of a controversial subject among some typographers. Currently I'm leaning towards not doing it. Regardless of that, it would also make the implementation much more complicated (or impossible) as it requires access to the kerning tables which, as far as I know, is not possible from a canvas context.

I would be very much interested in an open-source hyphenator (I'm aware of some other JavaScript implementations of Liang's algorithm) as I was planning to either use an existing one or to implement it myself.


Controversial because you can’t do it in metal? Or because if you overdo it it starts to look crappy?

I’m not sure too many people especially object to adjustments of 2-3% to inter-letter space or letter widths.


One of the problems with adjusting inter-letter spacing is that it completely breaks certain types of font: think of calligraphic scripts, or non-western writing systems such as Devanagari where the rendered glyphs touch to give the continuous appearance of the text.

Another problem is ligatures. Where adjusting inter-letter spacing is possible, it usually needs to be applied uniformly across at least a whole word to work well. If you are writing the word "difficult", and the "ffi" is rendered using a ligature to avoid awkward clashes, then how do you adjust the spacing for that? You could keep the ligature, but then the inter-letter spacing isn't uniform across the whole word. Alternatively, you could break the ligature, but that could restore the undesirable clashes between the "f", "f" and "i" that the ligature was designed to prevent. To avoid that, you would have to open up the inter-letter spacing so much that the letters no longer clash, which is probably far too obvious in most fonts to be a useful effect.


From what I read, the latter. Like you said, overdoing it quickly makes a text look bad, and it also ruins the (presumably) carefully created kerning tables that are supplied by the font. I assume you have read Hàn Thế Thành's paper on microtypographic extensions to TeX (http://www.pragma-ade.com/pdftex/thesis.pdf)?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: