The rationale given for including mirrored half-stars as separate codepoints is right-to-left languages. I wondered why this was needed, since Unicode already has the a right-to-left mark (RLM)[1].
I found the answer in a comment on "Explain XKCD".[2] The RLM usually only reorders characters, but does not mirror their glyphs. The exception are glyphs with the "Bidi_Mirrored=Yes" property, which are mapped to a mirrored codepoint.[3]
The half-stars proposal includes a note on that property: "Existing stars are in the “Other Neutrals” class, so half stars should probably use the ON bidirectional class. The half stars have the obvious mirrored counterparts, so they can be Bidi mirrored. However, similar characters such as
LEFT HALF BLACK CIRCLE are not marked as mirrored. I'll leave it up to the Unicode experts to determine if Bidi Mirrored would be appropriate or not."
The one I'm surprised about is not the stars, but actually the bitcoin character. It's just a form of branding to me, and while I think there's interesting uses for blockchain technology, public interest seems to be a bit inflated. Plus that blockchain tech will likely outlive bitcoin itself.
It's not like there is some central Bitcoin company so what is the brand? Brands are generally owned by companies and are intellectual property in the eyes of governments.
Unicode doesn't just contain things that are in use now and will be in use forever. It contains characters that were in use in one computer system once, characters for dead languages, abstract symbols that the next generation will barely understand, and more.
The Bitcoin symbol is used in textual documents today. It deserves to be in Unicode, or Unicode fails its goal of being able to encode any textual document.
Bitcoin may be short lived, but documents talking about it will not. Linear B isn't exactly widely used in new documents today, but it's still useful to have glyphs for it so that anthropologists can use them in documents about it.
As other people said, blockchain technology will outlive bitcoin. In this case I am saying "short" to mean a decade or so. I expect Unicode to last much longer.
I really don't understand the downvoting of an honest question. I am curious if he knows something I don't or maybe thinks short is some thing interesting like days or millions of years.
It is great to see Unicode being able to encode almost every symbol people can think of, however I am still struggling to make them appear on my screen - is there a good font that has great coverage for unicode? Many times there are clever use of unicode yet I can only see empty rectangles.
Keep in mind that what you want may not be one font that covers lots of glyphs -- that makes the font take up lots more memory and take longer to load. And you definitely wouldn't want to use a high-coverage Unicode font as a dynamically-loaded Web font.
Operating systems are fine at understanding that different fonts are necessary for different glyphs, so what's better in a lot of cases is to have a family of fonts that together cover all the glyphs you need. That's what Google Noto [1] is doing.
Symbola is a good font for covering a lot of symbols, while not representing many text characters (on the assumption that you already have fonts you prefer for text).
That said, there's a justification for having a few of the fonts on that chart, like Lucida Sans Unicode and Arial Unicode MS, because they guarantee consistency without you having to install a huge font family. GNU Unifont is also interesting in a hackery kind of way, in that it achieves good coverage by using only pixelly bitmaps.
But on the other hand, Code2000 is an awful font. It eats gobs of memory and it looks bad. Don't use it just because it has a lot of glyphs.
GNU Unifont is just a fallback font, which I think is what the parent really needs since they're most concerned about seeing the symbol and I doubt consistent appearance with their font.
Unicode does not dictate how glyphs are presented. It just describes and categorizes them.
So how they look comes from the font that is used. For the proposal these fonts probably didn't exist yet, so it was probably just a (slightly sloppy) photoshop.
That's a good point, and I should have clarified, I'm referring to the full stars (not half-stars in the new proposal). Not a Unicode issue, but definitely something I've seen at least on macOS machines.
We need to hold the line somewhere. Preferably before corporate logos get into Unicode. I've seen Facebook and Twitter icons as Unicode characters in the user-definable space. This currently requires a downloaded font, but there's probably some lobbyist somewhere trying to get them into Unicode.
It's getting really complicated. There are now skin-tone modifiers for emoji.
Unicode is turning into a few useful characters amid a sea of junk. This will continue as long as people acquire status by getting "their" symbol(s) into Unicode. I don't see any way this can change.
How are Windows and Java, which are somewhat tied to 16-bit Unicode, handling this? It used to be that the astral planes didn't matter much, but now they do.
That's what surrogate pairs are for. [1] You're no longer working with one code point per character, but even with 32-bit Unicode there's no real guarantee of that (consider things like combining characters, accents, emoji skin tones, etc.)
Unicode is 21 bits wide. And there's lots of space left. Heck, Emoji still make up very little of the total encoded characters, compared to “normal” human writing systems. (And I'd argue that emoji are by now a normal addition to writing, considering how many people use them daily and can be glad to have them interoperable across different platforms, carriers, and devices. Something that hasn't been that way previously.)
Unicode Technical Report #51, which is where Emoji are laid out, talks a bit about the current thinking of the committees on this:
> The longer-term goal for implementations should be to support embedded graphics, in addition to the emoji characters. Embedded graphics allow arbitrary emoji symbols, and are not dependent on additional Unicode encoding. Some examples of this are found in Skype and LINE—see the emoji press page for more examples.
> However, to be as effective and simple to use as emoji characters, a full solution requires significant infrastructure changes to allow simple, reliable input and transport of images (stickers) in texting, chat, mobile phones, email programs, virtual and mobile keyboards, and so on. (Even so, such images will never interchange in environments that only support plain text, such as email addresses.) Until that time, many implementations will need to use Unicode emoji instead
I simply cannot wrap my head around the direction of the Unicode discourse.
We're discussing the appropriate code-point for different smiley faces,
obscure electrical symbols[0] or, in the present case, half stars to express
film or book ratings, yet we have no complete set of sub- and superscripts!
Am I mistaken in thinking it odd, that there's a complete Klingon alphabet but no
representation whatsoever for most Greek or Latin subscripts? Or what if, heaven forbid,
I'd want to use a 'b' index/subscript? Tough! Not even the "phonetic extensions",
where subscript-i comes from, provides it.
Surely there's the one or two actual scientists on the Unicode consortium?
Or even the one odd soul still sporting a notion of consistency who finds it
only logical to provide a "subscript b" if there's a "subscript a"?
Unicode is not known for its consistency in dealing with these issues. The original idea behind Unicode was to be able to represent every then-extant character set with perfect fidelity (i.e., go from X to Unicode and back, and you should get the same data). Why are there letters like U+212B Angstrom sign (not to be confused with U+00C5 Latin capital A with ring above) or things like half-width and full-width characters? Because they were present in Shift-JIS, not because of any coherent notion of what constitutes a glyph. Han unification was driven more by the need to keep from blowing a space budget than by actual rationalization of whether or not the scripts deserved separate spaces.
Note that Klingon isn't in Unicode (it was explicitly rejected by the UTC, with a vote of 9 in favor of the rejection proposal, 0 against it, and 1 abstaining). Tengwar and Cirth, though, are actually considered serious proposals for Unicode, just really, really low priority compared to, say, Mayan script (for which the first proposal should be going live in 2017). Mayan script is interesting in its own right because it's the script (well, of the ones I'm aware of) that most challenges normal conventions on what constitutes letters and glyphs.
ISTM a great deal of trouble and complication could have been prevented by three special types of NBSP that meant "sub", "super", and "back to normal". It's true that some glyphs will be special-cased by some fonts, but in general the glyph is just shrunk and translated when sub- or super-scripted.
I disagree. In math there can be super-super-superscripts, as with tetration representations
https://en.wikipedia.org/wiki/Tetration . Does each get its own character, and when does it end?
In science, consider an isotope like
180m
Ta
73
This cannot be represented as a sequence of symbols because that would give:
180m 180m
Ta -or- Ta
73 73
Markup is how Wikipedia represents it correct, as:
In addition, pretty much anything can go in superscripts, including 2^א and integral equations. The most general solution is to have a "start superscript" and "end superscript" marker, with the ability to embed superscripts, but that still doesn't solve the isotope representation problem.
> The most general solution is to have a "start superscript" and "end superscript" marker, with the ability to embed superscripts, but that still doesn't solve the isotope representation problem.
Couldn't one have something like a "start zero-width superscript" marker, so that the following subscript would not be offset?
> Couldn't one have something like a "start zero-width superscript" marker, so that the following subscript would not be offset?
Well, the problem is that the subscript and superscript are both aligned with the following regular text, so you really need (for the isotope representation) a "start right-aligned zero-width superscript" marker, a "start right-aligned zero-width subscript" marker (though zero-width isn't exactly right, since they should have width, its just that only the wider of the super- and sub-script in a pair should be used in spacing the text) -- there might be other notation that also needs left-aligned versions of -- plus generic start/end superscript markers that have normal width flow, plus appropriate end markers.
It's not surprising that an offhand suggestion doesn't magically solve all problems, but I appreciate your taking the time carefully to explain what's missing. Thanks!
I have to disagree. All but 3 of those pictographs are already in the Unicode standard. You have to patch fonts because A) your preferred font may not have them and B) to make certain that the font meets Powerline's expectations.
The ones that are "unique" are a bit annoying because they replace defined characters in the Basic Multilingual Plane - Private Use section(E000-FFFF). Even though the section is "Private Use" it is often already defined by your OS's system font. There's the Supplemental Private Use Areas A (F0000-FFFFD) and B (100000-10FFFF) which can be overwritten safely.
I scare quote "unique" because two of those characters are full-height arrows; one right-pointing, the other left-pointing. These are already defined as u1F780 (🞀) u1F782 (🞂). It may be the case that some fonts that the triangles either A) don't actually go from floor to ceiling, or B) they have empty space behind their hypotenuse.
The only truly unique character is the "git branch" pictograph. Maybe, someone could write up a convincing argument to include it, but I can't imagine one. It's not a symbol you see to often even in the git community. And, I would bet if you looked hard enough, there's some mathematical symbol that would be suitable.
Just FYI, I've used powerline fonts daily for the past ~3 years.
That's great but what we really need (ahem- what I really need) is more maths-y characters, like ∑∏∫∀ and all the sub- and super- scripted letters: ⁱⁿₙᵢ and so on.
I can never find a lower-case Greek subscripted α or β when I need one...
> That's great but what we really need (ahem- what I really need) is more maths-y characters, like ∑∏∫∀ and all the sub- and super- scripted letters: ⁱⁿₙᵢ and so on.
Agreed, but what we need even more than the symbols is some ((La)TeXy, says the mathematician) way of combining them. For example (says the mathematician who doesn't understand the complexity of text encodings), why do we need a whole bunch of separate "subscript m", "subscript n", etc., glyphs, rather than just one "subscript" combining mark?
Unicode is a brilliant idea, but it went off the rails with combining characters, especially when there is both a code point for a character and a combining set of characters that semantically are the same thing.
How would you solve things without combining characters? Especially the case where you can have multiple diacritics on a letter. Encode every single combination of all of them? Seems a bit wasteful, don't you think?
Precomposed characters exist because they existed in other encodings previously and encoding such characters has been one of the core principles of Unicode to ensure an easy upgrade path. Heck, we inherited box drawing characters that way, which I think are more questionable than combining diacritics.
The precomposed characters only exist for compatibility with existing character sets and encodings. If you don't want to deal with them in your code, just normalize to NFD and they're gone. If Unicode didn't care about compatibility to legacy character sets at all, adoption would have been very different, I guess. By now it's probably a moot point since not supporting Unicode is foolish at best, but in the early 90s things were very different.
As for diacritics, it depends on what you care about for precomposing them. Actual usage for scripts in use currently? Then it's only a handful and the worst thing probably is Vietnamese or Ancient Greek which have a bunch of characters with more than one diacritic.
However, the current system with composable diacritics gives you plenty of flexibility: Need a character with a diacritic that isn't used in any language currently? Just compose them and you got it. Font support may be spotty (note that Unicode and font support are completely separate things – bashing Unicode for bad fonts is a fairly useless endeavour), but at least you can represent that grapheme in text without resorting to embedding images, or overlaying glyphs by other means (cf. TeX). Those options are also not interoperable with any other applications.
It also means that if some language now develops a script based on, say, Latin, and invents a new diacritic that can go on different vowels, you'd only have to encode a single new code point, not five or six of them. It scales far better and also isn't tied to any specific writing system. I can use ´ on a or on ω and it works the same.
And could you elaborate on how “nearly every Unicode program handles them wrongly”? I'd argue that most programs coming into contact with Unicode do little more than passing it along without caring about the contents at all. And trying to shoehorn human language into something an average programmer can handle without error is likely impossible. Language is complex, writing is complex; Unicode is complex as a result of that. This doesn't only apply to text, mind you, there are lots of things that are complex and are often implemented naïvely or wrongly by programmers who don't know any better. That usually means that programs are broken, and many programmers should know better. Not that we should try adjusting the world to broken programs.
> And could you elaborate on how “nearly every Unicode program handles them wrongly”?
A good chunk don't do surrogate pairs correctly (or are even aware of them), the rest get tripped up by the combining character issue. Even for those who understand it, there are no clear answers: "should a combining character compare equal to a precomposed one?" And of course there are 3 levels of UCS support.
The whole existence of an unnormalized form is a gigantic mistake that could have been easily avoided - simply make the unnormalized form an illegal sequence to begin with.
Unicode programming hasn't gotten as bad yet as timezone programming, but they are well on their way :-(
The other day I was searching for the words for bronze in Tibetan for research on possible etymologies of some Tibeto-Burman phonetic transliterations in to middle Chinese.[0] (As you do) Anyway, I found some low resolution entries in scanned dictionaries online without romanization, but was unable to translate these to codepoints to obtain a phonetic approximation, even after using online keyboards, due to the hassles of combining characters. I have studied a lot of abugidas (Tai/Lao/Khmer/etc.) so am not exactly coming at the problem from scratch, either. Also rather shocked that the Tibetan community hasn't managed to put a decent dictionary online yet.
Ah, I see. Something like ◔ "CIRCLE WITH UPPER RIGHT QUADRANT BLACK".
Someone requested something similar here [1], and someone else made it using CSS here [2]. As the article explains though, it would need to be used in text for the Unicode committee to accept it.
I found the answer in a comment on "Explain XKCD".[2] The RLM usually only reorders characters, but does not mirror their glyphs. The exception are glyphs with the "Bidi_Mirrored=Yes" property, which are mapped to a mirrored codepoint.[3]
The half-stars proposal includes a note on that property: "Existing stars are in the “Other Neutrals” class, so half stars should probably use the ON bidirectional class. The half stars have the obvious mirrored counterparts, so they can be Bidi mirrored. However, similar characters such as LEFT HALF BLACK CIRCLE are not marked as mirrored. I'll leave it up to the Unicode experts to determine if Bidi Mirrored would be appropriate or not."
[1] https://en.wikipedia.org/wiki/Right-to-left_mark
[2] https://www.explainxkcd.com/wiki/index.php/1137:_RTL
[3] http://www.unicode.org/Public/UNIDATA/BidiMirroring.txt