Why and how you ought to keep multibyte character support simple [pdf]

ploxiln · on Sept 30, 2016

On the "Caveats for xterm" page it says

  On  other  operating  systems  except  OpenBSD,  there  is
  no  way  in  hell  to  make the  interaction  of  locales
  with terminal controls truly safe.

But consider a linux system based on musl libc, it's not very different from openbsd's policy of utf-8 and ascii only, it's probably pretty close even if not perfect:

http://wiki.musl-libc.org/wiki/Functional_differences_from_g...

stsp · on Sept 30, 2016

It should probably say "On operating systems which support arbitrary locales, ..."

brandmeyer · on Sept 30, 2016

I don't understand how the algorithm on page 21 works. Aren't many Unicode characters formed with multiple code points, like <modifying-mark><basic-character>? If these are reversed to be <basic-character><modifying-mark>, then the textual output would actually be different, wouldn't it?

Shouldn't rev(1) reverse graphemes instead of code points?

stsp · on Sept 30, 2016

Yes, rev(1) probably should handle combined characters.

But those are a property of Unicode, not UTF-8. UTF-8 encodes code points, and we often try to get away without decoding them. Of course the resulting Unicode can change its meaning but it's still valid Unicode (and valid UTF-8).

In some cases we already look at Unicode properties (such as a character's column width). So perhaps we can find a nice way to fix this problem in rev(1), some day.

There are many more interesting Unicode issues we don't address in OpenBSD's UTF-8 support (e.g. han unification, pre-composed vs de-composed normalization).

But we have to start somewhere.

Perhaps, eventually, someone will specify a minimal and sane variant of unicode, which removes all the ambiguities, edge cases, and silly symbols. We'd probably switch over in a heartbeat.

ramshorns · on Sept 30, 2016

What would a minimal and sane variant of Unicode be like? Removing the weird behaviour of Unicode would necessarily mean removing support for some characters, like those that only exist in decomposed form with combining diacritics, and some types of scripts like right-to-left. Mapping code points, characters and graphemes one-to-one seems like it would make text processing easier at the cost of excluding a large portion of the character set.

I guess it would form a middle ground; US-ASCII is also a minimal subset of Unicode where text processing is easy.

karlmdavis · on Sept 30, 2016

Ding ding! Hard things are hard.

It seems... at least a bit arrogant for a developer that doesn't write any of the languages that rely on these features to claim that they're insane and excessive.

Arnt · on Sept 30, 2016

You'd switch, lose the ability to convert between unicode and those random 8-bit encoding, and end up having support for the encodings you can't solve to converting to unicode any more.

ChoHag · on Oct 1, 2016

Once we have a minimal and sane variant of humans without ambiguities, edge cases and silly symbols we can get right on that.

niccaluim · on Sept 30, 2016

FWIW the Unicode spec describes combining marks as characters in their own right. So if the intent is to reverse characters, page 21 does the job. The resulting sequences will potentially be defective but not ill-formed.

That being said, an FAQ on combining characters points out that Unicode's definition of "character" may not match an end user's, and that it's best to use the word "grapheme" instead for clarity. (And that being said, if the typical end user knows what "grapheme" means, I'll eat my cat.)

So from a practical standpoint, it's best to make sure that any input to rev is in one of the composed normal forms.

(Incidentally, the proper sequence is <base character><combining character>…, not the other way around.)

bhaak · on Sept 30, 2016

> So from a practical standpoint, it's best to make sure that any input to rev is in one of the composed normal forms.

But there are real world characters that don't have precomposed forms (IIRC e.g. indic scripts).

kps · on Sept 30, 2016

  > Incidentally, the proper sequence is <base character><combining character>…, not the other way around.

A mistake in Unicode, IMHO. The other way around, it would have been possible to identify the end of a combining sequence without looking past the sequence. Also, ‘dead keys’ could have directly generated the required combining characters just like normal characters, rather than requiring special processing.

slededit · on Sept 30, 2016

It looks like the talk is limited to instructions on how to produce well formed utf-8 string transformations. Once you go farther than that your into the nitty gritty of internationalization which is itself a moving target as cultures change.

For example how should strrev handle BIDI control characters? When is the unicode BIDI algorithm appropriate and when is it not? (hint: formulas look pretty messed up with it). At some point these become application specific.

jprzybyl · on Sept 30, 2016

> Shouldn't rev(1) reverse graphemes instead of code points?

I honestly don't know. Is the intended purpose of this program to reverse bytes, or reverse characters, or reverse grapheme clusters? Or extended grapheme clusters?

There's no spec - this has never been in POSIX. What is your expected behavior? Is it mine?

For what it's worth, I needed rev recently, but forgot that it existed and did this:

    perl -ne 'chomp; print scalar reverse . "\n"'

If I need it to handle UTF-8 in a certain way, I can use pragmas to change it's behavior. (I'm pretty sure that this, as it is, will ignore the surrounding locale.)

stsp · on Sept 30, 2016

UTF-8 can also be handled in several ways. There is a lot of middle ground between software that handles bytes, and typesetting software which is fully unicode-aware. The small unix utilities fall somewhere in there.

masklinn · on Sept 30, 2016

> UTF-8 can also be handled in several ways.

UTF8 can not be handled in several ways without breaking it, it's a pretty straightforward and strict encoding.

kahirsch · on Sept 30, 2016

What would you expect the output to be when the input is:

    nôn
    nôn

Sending that through rev (with a UTF-8 locale), I get

    n̂on
    nôn

By the way, did you know the perl -l flag removes newlines on input and adds them for a print, so your command could just be:

  perl -lne 'print scalar reverse'

And, for a unicode-aware version:

  perl -CDS -lne 'print scalar reverse'

kccqzy · on Sept 30, 2016

I would say, it should instead reverse extended grapheme clusters. The advent of emojis and emoji ZWJ sequences make those kind of "characters" especially common. For example the family emoji ‍‍‍‍‍‍ is thought of as a single "character" from the user's perspective but it actually has seven code points.

EDIT: HN ate my emoji so there's a link to its emojipedia page: http://emojipedia.org/family-man-woman-girl-boy/

Animats · on Sept 30, 2016

I had no idea that had been accepted. Here's Unicode TR51, with the whole set.[1]

You can now specify skin tone, but cannot yet mix skin tone within a family group. Someone is probably already demanding that feature.

http://www.unicode.org/reports/tr51/tr51-2.html

legulere · on Sept 30, 2016

> The advent of emojis and emoji ZWJ sequences make those kind of "characters" especially common.

They're already common through MacOS filesystems. MacOS by standard decomposes characters (ä into ¨+a)

masklinn · on Sept 30, 2016

> ä into ¨+a

You've reversed the encoding (the combining character follows, so U+00E4 decomposes to U+0061 U+0308), and that's a simpler case as combining characters are combining, they have very specific properties which mark them as combining characters and as continuation of a grapheme. AFAIK that's not the case for ZWJ sequences.

There's also a case to be made for reverse(WOMAN ZWJ HEAVY BLACK HEART ZWJ MAN) to be MAN ZWJ HEAVY BLACK HEART ZWJ WOMAN (sadly the latter does not seem to be handled correctly, at least by Apple's rendering system, it fails to combine into a single "couple" glyph)

masklinn · on Sept 30, 2016

> then the textual output would actually be different, wouldn't it?

Yes. The provided algorithm doesn't break UTF-8, but it will break the encoded text as it's not unicode-aware. I'm pretty sure it'll also leave through invalid UTF8 (lone surrogates or non-shortest encoding).

wtbob · on Sept 30, 2016

My takeaway is that POSIX is completely broken and needs to be re-evaluated.

ChoHag · on Oct 1, 2016

It'll fit right in then.

dom0 · on Sept 30, 2016

big news

gberger · on Sept 30, 2016

Why the random photos on each slide?

stsp · on Sept 30, 2016

The photos are all from the area around Calgary, where some of the initial ideas were born during an OpenBSD hackathon. IIRC we disabled Latin1 support during this hackathon.

While giving this talk in Belgrade, Ingo apologized he didn't have photos from a Belgrade hike yet so he used the Calgary ones instead.

odabaxok · on Sept 30, 2016

This does not answer the original question. What is the purpose of these photos?

coldtea · on Sept 30, 2016

It's a little thing called decoration, you should look it up sometime...

Scarblac · on Sept 30, 2016

I like photos as decoration, but they all have captures, in the same font as the presentation. The captures have quite some information (like the heights of the mountains). The photos are placed where you'd expect presentation images to be as well. But it's all entirely unrelated to the presentation.

All in all it's much more distracting than normal decoration.

odabaxok · on Sept 30, 2016

To answer in your tone: You misspelled distraction. ;)

...but seriously, it does not add any value to the presentation. Also, every photo has a caption, which can be truly a distraction. I can imagine if someone tries to read all of them and loses track in the presentation in every single slide. The author could have decorated with something on-topic, if he felt the slides were too plain.

coldtea · on Oct 1, 2016

>To answer in your tone: You misspelled distraction. ;)

Heh, less of a malevolent tone, and more of a reference to a Futurama episode (s2e6).

hueving · on Sept 30, 2016

To capture imagery of Calgary.

brynet · on Sept 30, 2016

These are slides that a developer put together to show off to other developers, i.e: friends. He included a photo journey to go along with all the boring technical stuff, what's your problem?

gberger · on Oct 4, 2016

No problem, just curious.

FullyFunctional · on Oct 3, 2016

Honest question: why did they keep C instead of (like Plan9) going all-out UTF-8?