Line Breaking (2014)

microtherion · on July 19, 2018

My favorite discussion of this is in TeX: The Program, starting at §813 on page 302: http://brokestream.com/tex.pdf (I'm not sure if the discussion can be followed by jumping in in the middle; I read the book cover to cover at one point).

amelius · on July 19, 2018

While theoretically very interesting, this seems to be pretty much a non-problem given today's computing power, combined with the fact that paragraphs are usually not very long, and the fact that imperfections are not actually disastrous.

userbinator · on July 19, 2018

...in the same way that bubblesort is a "non-problem" with small sizes?

The problem is that in the real world, especially with untrusted arbitrary input, you can easily cause these algorithms to take maximum time. Combine that with the typical (in)efficiency of HLLs and it's a potential for DoS. See the exponential backtracking behaviour of some regex engines for a related example.

idiocyreigns · on July 19, 2018

It will always be important to sanitize and validate your input.

wongarsu · on July 19, 2018

Sure, but that doesn't mean we need to introduce more classes of problems we have to check against. Designing algorithms to handle all cases efficiently is often better than trying to filter everything that triggers cases the lazy algorithms can't handle

pfortuny · on July 19, 2018

Manuals and texts on humanities may perfectly have paragraphs with more than 500 words. So, at 250.000 (n^2) it may take quite a while.

amelius · on July 19, 2018

250,000 is nothing for a modern computer with often more than 1M pixels on its screen.

pfortuny · on July 21, 2018

But that is just ONE paragraph under ONE circumstance.

kccqzy · on July 18, 2018

Add hyphenation, protrusion, and character expansion to the mix and I'm interested in seeing how these can be handled.

mort96 · on July 19, 2018

When adding hyphenation, one should ideally start thinking of internationalization, because different languages have different rules for hyphenation. In Norwegian, for example, one should generally add a hyphenation between each stem (for example, our word for fruit salad is "fruktsalat"; that should be hyphenated as "frukt-salat", not as "fruktsa-lat").

It's really jarring to read Norwegian text hyphenated by an algorithm which uses American rules for hyphenation.

bradbeattie · on July 19, 2018

The problem might be made notably more difficult when accounting for non-monospaced fonts and rivers: https://en.m.wikipedia.org/wiki/River_(typography)

wnoise · on July 19, 2018

Penalizing rivers would complicate things, but non-monospaced fonts are handled with only small modification.

Mikhail_Edoshin · on July 19, 2018

Rivers are especially hard to detect because to model what people see the program would have to look at actual shapes of the letters. Two similar configurations of character boxes may or may not give the appearance of a river depending on which characters are in these boxes. E.g. if a diagonal river passes along the long stem of "y", the shape of the letter contributes to the illusion, but if the river goes along the short stem, them the longer stem will visually work against it.

zypeh · on July 19, 2018

I like the graphs and the design!

m8rl · on July 19, 2018

Thank you Juraj for this work and the also very promising flat and even projects!

abakus · on July 19, 2018

It is almost a non problem for Chinese text: words are mono spaced squares.