On other operating systems except OpenBSD, there is
no way in hell to make the interaction of locales
with terminal controls truly safe.
But consider a linux system based on musl libc, it's not very different from openbsd's policy of utf-8 and ascii only, it's probably pretty close even if not perfect:
I don't understand how the algorithm on page 21 works. Aren't many Unicode characters formed with multiple code points, like <modifying-mark><basic-character>? If these are reversed to be <basic-character><modifying-mark>, then the textual output would actually be different, wouldn't it?
Shouldn't rev(1) reverse graphemes instead of code points?
Yes, rev(1) probably should handle combined characters.
But those are a property of Unicode, not UTF-8. UTF-8 encodes code points, and we often try to get away without decoding them. Of course the resulting Unicode can change its meaning but it's still valid Unicode (and valid UTF-8).
In some cases we already look at Unicode properties (such as a character's column width). So perhaps we can find a nice way to fix this problem in rev(1), some day.
There are many more interesting Unicode issues we don't address in OpenBSD's UTF-8 support (e.g. han unification, pre-composed vs de-composed normalization).
But we have to start somewhere.
Perhaps, eventually, someone will specify a minimal and sane variant of unicode, which removes all the ambiguities, edge cases, and silly symbols. We'd probably switch over in a heartbeat.
What would a minimal and sane variant of Unicode be like? Removing the weird behaviour of Unicode would necessarily mean removing support for some characters, like those that only exist in decomposed form with combining diacritics, and some types of scripts like right-to-left. Mapping code points, characters and graphemes one-to-one seems like it would make text processing easier at the cost of excluding a large portion of the character set.
I guess it would form a middle ground; US-ASCII is also a minimal subset of Unicode where text processing is easy.
It seems... at least a bit arrogant for a developer that doesn't write any of the languages that rely on these features to claim that they're insane and excessive.
You'd switch, lose the ability to convert between unicode and those random 8-bit encoding, and end up having support for the encodings you can't solve to converting to unicode any more.
FWIW the Unicode spec describes combining marks as characters in their own right. So if the intent is to reverse characters, page 21 does the job. The resulting sequences will potentially be defective but not ill-formed.
That being said, an FAQ on combining characters points out that Unicode's definition of "character" may not match an end user's, and that it's best to use the word "grapheme" instead for clarity. (And that being said, if the typical end user knows what "grapheme" means, I'll eat my cat.)
So from a practical standpoint, it's best to make sure that any input to rev is in one of the composed normal forms.
(Incidentally, the proper sequence is <base character><combining character>…, not the other way around.)
> Incidentally, the proper sequence is <base character><combining character>…, not the other way around.
A mistake in Unicode, IMHO. The other way around, it would have been possible to identify the end of a combining sequence without looking past the sequence. Also, ‘dead keys’ could have directly generated the required combining characters just like normal characters, rather than requiring special processing.
It looks like the talk is limited to instructions on how to produce well formed utf-8 string transformations. Once you go farther than that your into the nitty gritty of internationalization which is itself a moving target as cultures change.
For example how should strrev handle BIDI control characters? When is the unicode BIDI algorithm appropriate and when is it not? (hint: formulas look pretty messed up with it). At some point these become application specific.
> Shouldn't rev(1) reverse graphemes instead of code points?
I honestly don't know. Is the intended purpose of this program to reverse bytes, or reverse characters, or reverse grapheme clusters? Or extended grapheme clusters?
There's no spec - this has never been in POSIX. What is your expected behavior? Is it mine?
For what it's worth, I needed rev recently, but forgot that it existed and did this:
perl -ne 'chomp; print scalar reverse . "\n"'
If I need it to handle UTF-8 in a certain way, I can use pragmas to change it's behavior. (I'm pretty sure that this, as it is, will ignore the surrounding locale.)
UTF-8 can also be handled in several ways. There is a lot of middle ground between software that handles bytes, and typesetting software which is fully unicode-aware.
The small unix utilities fall somewhere in there.
I would say, it should instead reverse extended grapheme clusters. The advent of emojis and emoji ZWJ sequences make those kind of "characters" especially common. For example the family emoji is thought of as a single "character" from the user's perspective but it actually has seven code points.
You've reversed the encoding (the combining character follows, so U+00E4 decomposes to U+0061 U+0308), and that's a simpler case as combining characters are combining, they have very specific properties which mark them as combining characters and as continuation of a grapheme. AFAIK that's not the case for ZWJ sequences.
There's also a case to be made for reverse(WOMAN ZWJ HEAVY BLACK HEART ZWJ MAN) to be MAN ZWJ HEAVY BLACK HEART ZWJ WOMAN (sadly the latter does not seem to be handled correctly, at least by Apple's rendering system, it fails to combine into a single "couple" glyph)
> then the textual output would actually be different, wouldn't it?
Yes. The provided algorithm doesn't break UTF-8, but it will break the encoded text as it's not unicode-aware. I'm pretty sure it'll also leave through invalid UTF8 (lone surrogates or non-shortest encoding).
The photos are all from the area around Calgary, where some of the initial ideas were born during an OpenBSD hackathon. IIRC we disabled Latin1 support during this hackathon.
While giving this talk in Belgrade, Ingo apologized he didn't have photos from a Belgrade hike yet so he used the Calgary ones instead.
I like photos as decoration, but they all have captures, in the same font as the presentation. The captures have quite some information (like the heights of the mountains). The photos are placed where you'd expect presentation images to be as well. But it's all entirely unrelated to the presentation.
All in all it's much more distracting than normal decoration.
To answer in your tone: You misspelled distraction. ;)
...but seriously, it does not add any value to the presentation. Also, every photo has a caption, which can be truly a distraction. I can imagine if someone tries to read all of them and loses track in the presentation in every single slide. The author could have decorated with something on-topic, if he felt the slides were too plain.
These are slides that a developer put together to show off to other developers, i.e: friends. He included a photo journey to go along with all the boring technical stuff, what's your problem?
http://wiki.musl-libc.org/wiki/Functional_differences_from_g...