Hacker News new | past | comments | ask | show | jobs | submit login
Why and how you ought to keep multibyte character support simple [pdf] (openbsd.org)
98 points by protomyth on Sept 29, 2016 | hide | past | favorite | 35 comments



On the "Caveats for xterm" page it says

  On  other  operating  systems  except  OpenBSD,  there  is
  no  way  in  hell  to  make the  interaction  of  locales
  with terminal controls truly safe.
But consider a linux system based on musl libc, it's not very different from openbsd's policy of utf-8 and ascii only, it's probably pretty close even if not perfect:

http://wiki.musl-libc.org/wiki/Functional_differences_from_g...


It should probably say "On operating systems which support arbitrary locales, ..."


I don't understand how the algorithm on page 21 works. Aren't many Unicode characters formed with multiple code points, like <modifying-mark><basic-character>? If these are reversed to be <basic-character><modifying-mark>, then the textual output would actually be different, wouldn't it?

Shouldn't rev(1) reverse graphemes instead of code points?


Yes, rev(1) probably should handle combined characters.

But those are a property of Unicode, not UTF-8. UTF-8 encodes code points, and we often try to get away without decoding them. Of course the resulting Unicode can change its meaning but it's still valid Unicode (and valid UTF-8).

In some cases we already look at Unicode properties (such as a character's column width). So perhaps we can find a nice way to fix this problem in rev(1), some day.

There are many more interesting Unicode issues we don't address in OpenBSD's UTF-8 support (e.g. han unification, pre-composed vs de-composed normalization).

But we have to start somewhere.

Perhaps, eventually, someone will specify a minimal and sane variant of unicode, which removes all the ambiguities, edge cases, and silly symbols. We'd probably switch over in a heartbeat.


What would a minimal and sane variant of Unicode be like? Removing the weird behaviour of Unicode would necessarily mean removing support for some characters, like those that only exist in decomposed form with combining diacritics, and some types of scripts like right-to-left. Mapping code points, characters and graphemes one-to-one seems like it would make text processing easier at the cost of excluding a large portion of the character set.

I guess it would form a middle ground; US-ASCII is also a minimal subset of Unicode where text processing is easy.


Ding ding! Hard things are hard.

It seems... at least a bit arrogant for a developer that doesn't write any of the languages that rely on these features to claim that they're insane and excessive.


You'd switch, lose the ability to convert between unicode and those random 8-bit encoding, and end up having support for the encodings you can't solve to converting to unicode any more.


Once we have a minimal and sane variant of humans without ambiguities, edge cases and silly symbols we can get right on that.


FWIW the Unicode spec describes combining marks as characters in their own right. So if the intent is to reverse characters, page 21 does the job. The resulting sequences will potentially be defective but not ill-formed.

That being said, an FAQ on combining characters points out that Unicode's definition of "character" may not match an end user's, and that it's best to use the word "grapheme" instead for clarity. (And that being said, if the typical end user knows what "grapheme" means, I'll eat my cat.)

So from a practical standpoint, it's best to make sure that any input to rev is in one of the composed normal forms.

(Incidentally, the proper sequence is <base character><combining character>…, not the other way around.)


> So from a practical standpoint, it's best to make sure that any input to rev is in one of the composed normal forms.

But there are real world characters that don't have precomposed forms (IIRC e.g. indic scripts).


  > Incidentally, the proper sequence is <base character><combining character>…, not the other way around.
A mistake in Unicode, IMHO. The other way around, it would have been possible to identify the end of a combining sequence without looking past the sequence. Also, ‘dead keys’ could have directly generated the required combining characters just like normal characters, rather than requiring special processing.


It looks like the talk is limited to instructions on how to produce well formed utf-8 string transformations. Once you go farther than that your into the nitty gritty of internationalization which is itself a moving target as cultures change.

For example how should strrev handle BIDI control characters? When is the unicode BIDI algorithm appropriate and when is it not? (hint: formulas look pretty messed up with it). At some point these become application specific.


> Shouldn't rev(1) reverse graphemes instead of code points?

I honestly don't know. Is the intended purpose of this program to reverse bytes, or reverse characters, or reverse grapheme clusters? Or extended grapheme clusters?

There's no spec - this has never been in POSIX. What is your expected behavior? Is it mine?

For what it's worth, I needed rev recently, but forgot that it existed and did this:

    perl -ne 'chomp; print scalar reverse . "\n"'
If I need it to handle UTF-8 in a certain way, I can use pragmas to change it's behavior. (I'm pretty sure that this, as it is, will ignore the surrounding locale.)


UTF-8 can also be handled in several ways. There is a lot of middle ground between software that handles bytes, and typesetting software which is fully unicode-aware. The small unix utilities fall somewhere in there.


> UTF-8 can also be handled in several ways.

UTF8 can not be handled in several ways without breaking it, it's a pretty straightforward and strict encoding.


What would you expect the output to be when the input is:

    nôn
    nôn
Sending that through rev (with a UTF-8 locale), I get

    n̂on
    nôn
By the way, did you know the perl -l flag removes newlines on input and adds them for a print, so your command could just be:

  perl -lne 'print scalar reverse'
And, for a unicode-aware version:

  perl -CDS -lne 'print scalar reverse'


I would say, it should instead reverse extended grapheme clusters. The advent of emojis and emoji ZWJ sequences make those kind of "characters" especially common. For example the family emoji ‍‍‍‍‍‍ is thought of as a single "character" from the user's perspective but it actually has seven code points.

EDIT: HN ate my emoji so there's a link to its emojipedia page: http://emojipedia.org/family-man-woman-girl-boy/


I had no idea that had been accepted. Here's Unicode TR51, with the whole set.[1]

You can now specify skin tone, but cannot yet mix skin tone within a family group. Someone is probably already demanding that feature.

http://www.unicode.org/reports/tr51/tr51-2.html


> The advent of emojis and emoji ZWJ sequences make those kind of "characters" especially common.

They're already common through MacOS filesystems. MacOS by standard decomposes characters (ä into ¨+a)


> ä into ¨+a

You've reversed the encoding (the combining character follows, so U+00E4 decomposes to U+0061 U+0308), and that's a simpler case as combining characters are combining, they have very specific properties which mark them as combining characters and as continuation of a grapheme. AFAIK that's not the case for ZWJ sequences.

There's also a case to be made for reverse(WOMAN ZWJ HEAVY BLACK HEART ZWJ MAN) to be MAN ZWJ HEAVY BLACK HEART ZWJ WOMAN (sadly the latter does not seem to be handled correctly, at least by Apple's rendering system, it fails to combine into a single "couple" glyph)


> then the textual output would actually be different, wouldn't it?

Yes. The provided algorithm doesn't break UTF-8, but it will break the encoded text as it's not unicode-aware. I'm pretty sure it'll also leave through invalid UTF8 (lone surrogates or non-shortest encoding).


My takeaway is that POSIX is completely broken and needs to be re-evaluated.


It'll fit right in then.


big news


Why the random photos on each slide?


The photos are all from the area around Calgary, where some of the initial ideas were born during an OpenBSD hackathon. IIRC we disabled Latin1 support during this hackathon.

While giving this talk in Belgrade, Ingo apologized he didn't have photos from a Belgrade hike yet so he used the Calgary ones instead.


This does not answer the original question. What is the purpose of these photos?


It's a little thing called decoration, you should look it up sometime...


I like photos as decoration, but they all have captures, in the same font as the presentation. The captures have quite some information (like the heights of the mountains). The photos are placed where you'd expect presentation images to be as well. But it's all entirely unrelated to the presentation.

All in all it's much more distracting than normal decoration.


To answer in your tone: You misspelled distraction. ;)

...but seriously, it does not add any value to the presentation. Also, every photo has a caption, which can be truly a distraction. I can imagine if someone tries to read all of them and loses track in the presentation in every single slide. The author could have decorated with something on-topic, if he felt the slides were too plain.


>To answer in your tone: You misspelled distraction. ;)

Heh, less of a malevolent tone, and more of a reference to a Futurama episode (s2e6).


To capture imagery of Calgary.


These are slides that a developer put together to show off to other developers, i.e: friends. He included a photo journey to go along with all the boring technical stuff, what's your problem?


No problem, just curious.


Honest question: why did they keep C instead of (like Plan9) going all-out UTF-8?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: