Hacker News new | past | comments | ask | show | jobs | submit login
Line Breaking (2014) (xxyxyz.org)
91 points by netgusto on July 18, 2018 | hide | past | favorite | 16 comments



My favorite discussion of this is in TeX: The Program, starting at §813 on page 302: http://brokestream.com/tex.pdf (I'm not sure if the discussion can be followed by jumping in in the middle; I read the book cover to cover at one point).


While theoretically very interesting, this seems to be pretty much a non-problem given today's computing power, combined with the fact that paragraphs are usually not very long, and the fact that imperfections are not actually disastrous.


...in the same way that bubblesort is a "non-problem" with small sizes?

The problem is that in the real world, especially with untrusted arbitrary input, you can easily cause these algorithms to take maximum time. Combine that with the typical (in)efficiency of HLLs and it's a potential for DoS. See the exponential backtracking behaviour of some regex engines for a related example.


It will always be important to sanitize and validate your input.


Sure, but that doesn't mean we need to introduce more classes of problems we have to check against. Designing algorithms to handle all cases efficiently is often better than trying to filter everything that triggers cases the lazy algorithms can't handle


Manuals and texts on humanities may perfectly have paragraphs with more than 500 words. So, at 250.000 (n^2) it may take quite a while.


250,000 is nothing for a modern computer with often more than 1M pixels on its screen.


But that is just ONE paragraph under ONE circumstance.


Add hyphenation, protrusion, and character expansion to the mix and I'm interested in seeing how these can be handled.


When adding hyphenation, one should ideally start thinking of internationalization, because different languages have different rules for hyphenation. In Norwegian, for example, one should generally add a hyphenation between each stem (for example, our word for fruit salad is "fruktsalat"; that should be hyphenated as "frukt-salat", not as "fruktsa-lat").

It's really jarring to read Norwegian text hyphenated by an algorithm which uses American rules for hyphenation.


The problem might be made notably more difficult when accounting for non-monospaced fonts and rivers: https://en.m.wikipedia.org/wiki/River_(typography)


Penalizing rivers would complicate things, but non-monospaced fonts are handled with only small modification.


Rivers are especially hard to detect because to model what people see the program would have to look at actual shapes of the letters. Two similar configurations of character boxes may or may not give the appearance of a river depending on which characters are in these boxes. E.g. if a diagonal river passes along the long stem of "y", the shape of the letter contributes to the illusion, but if the river goes along the short stem, them the longer stem will visually work against it.


I like the graphs and the design!


Thank you Juraj for this work and the also very promising flat and even projects!


It is almost a non problem for Chinese text: words are mono spaced squares.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: