Hacker News new | past | comments | ask | show | jobs | submit login
How I added 6 characters to Unicode (and you can too) (righto.com)
129 points by deafcalculus on Oct 4, 2016 | hide | past | favorite | 65 comments



The rationale given for including mirrored half-stars as separate codepoints is right-to-left languages. I wondered why this was needed, since Unicode already has the a right-to-left mark (RLM)[1].

I found the answer in a comment on "Explain XKCD".[2] The RLM usually only reorders characters, but does not mirror their glyphs. The exception are glyphs with the "Bidi_Mirrored=Yes" property, which are mapped to a mirrored codepoint.[3]

The half-stars proposal includes a note on that property: "Existing stars are in the “Other Neutrals” class, so half stars should probably use the ON bidirectional class. The half stars have the obvious mirrored counterparts, so they can be Bidi mirrored. However, similar characters such as LEFT HALF BLACK CIRCLE are not marked as mirrored. I'll leave it up to the Unicode experts to determine if Bidi Mirrored would be appropriate or not."

[1] https://en.wikipedia.org/wiki/Right-to-left_mark

[2] https://www.explainxkcd.com/wiki/index.php/1137:_RTL

[3] http://www.unicode.org/Public/UNIDATA/BidiMirroring.txt


I also enjoyed this Hacker News article about adding the electrical on, off sleep, standby symbols to Unicode. http://unicodepowersymbol.com/we-did-it-how-a-comment-on-hac... https://news.ycombinator.com/item?id=11958682


The one I'm surprised about is not the stars, but actually the bitcoin character. It's just a form of branding to me, and while I think there's interesting uses for blockchain technology, public interest seems to be a bit inflated. Plus that blockchain tech will likely outlive bitcoin itself.


It's not like there is some central Bitcoin company so what is the brand? Brands are generally owned by companies and are intellectual property in the eyes of governments.


Whatever it is, it's likely to be short-lived and therefore a questionable addition to Unicode.


Unicode doesn't just contain things that are in use now and will be in use forever. It contains characters that were in use in one computer system once, characters for dead languages, abstract symbols that the next generation will barely understand, and more.

The Bitcoin symbol is used in textual documents today. It deserves to be in Unicode, or Unicode fails its goal of being able to encode any textual document.


Yes, there's already symbols for other dead currencies in there, for example.


Bitcoin may be short lived, but documents talking about it will not. Linear B isn't exactly widely used in new documents today, but it's still useful to have glyphs for it so that anthropologists can use them in documents about it.


That is a good point.


Why do you think it will be short lived and on what scale is it short?


As other people said, blockchain technology will outlive bitcoin. In this case I am saying "short" to mean a decade or so. I expect Unicode to last much longer.


I really don't understand the downvoting of an honest question. I am curious if he knows something I don't or maybe thinks short is some thing interesting like days or millions of years.


It's a currency. I bet that unicode has glyphs for even more obscure currencies.


It is great to see Unicode being able to encode almost every symbol people can think of, however I am still struggling to make them appear on my screen - is there a good font that has great coverage for unicode? Many times there are clever use of unicode yet I can only see empty rectangles.



Keep in mind that what you want may not be one font that covers lots of glyphs -- that makes the font take up lots more memory and take longer to load. And you definitely wouldn't want to use a high-coverage Unicode font as a dynamically-loaded Web font.

Operating systems are fine at understanding that different fonts are necessary for different glyphs, so what's better in a lot of cases is to have a family of fonts that together cover all the glyphs you need. That's what Google Noto [1] is doing.

[1] https://www.google.com/get/noto/

Symbola is a good font for covering a lot of symbols, while not representing many text characters (on the assumption that you already have fonts you prefer for text).

That said, there's a justification for having a few of the fonts on that chart, like Lucida Sans Unicode and Arial Unicode MS, because they guarantee consistency without you having to install a huge font family. GNU Unifont is also interesting in a hackery kind of way, in that it achieves good coverage by using only pixelly bitmaps.

But on the other hand, Code2000 is an awful font. It eats gobs of memory and it looks bad. Don't use it just because it has a lot of glyphs.


GNU Unifont is just a fallback font, which I think is what the parent really needs since they're most concerned about seeing the symbol and I doubt consistent appearance with their font.

https://en.wikipedia.org/wiki/Fallback_font


Symbola is a good one: http://users.teilar.gr/~g1951d/


I love this – but does it bother anyone else that the outlined and filled stars have different sizes? What's the reason behind that?

HN strips the characters out from comments, but they're displayed in the beginning of the article.


Unicode does not dictate how glyphs are presented. It just describes and categorizes them.

So how they look comes from the font that is used. For the proposal these fonts probably didn't exist yet, so it was probably just a (slightly sloppy) photoshop.


That's a good point, and I should have clarified, I'm referring to the full stars (not half-stars in the new proposal). Not a Unicode issue, but definitely something I've seen at least on macOS machines.


Wouldn't that depend on the font? They appear the same size in my browser (Firefox), in 15px Arial on Windows 7.


So glad the unicodepowersymbol.com stuff was helpful! We had a lot of fun getting the proposal together.

If anyone wants to submit some new characters, all of our documents are on GitHub https://github.com/jloughry/Unicode


We need to hold the line somewhere. Preferably before corporate logos get into Unicode. I've seen Facebook and Twitter icons as Unicode characters in the user-definable space. This currently requires a downloaded font, but there's probably some lobbyist somewhere trying to get them into Unicode.

It's getting really complicated. There are now skin-tone modifiers for emoji.


Unicode is turning into a few useful characters amid a sea of junk. This will continue as long as people acquire status by getting "their" symbol(s) into Unicode. I don't see any way this can change.


How are Windows and Java, which are somewhat tied to 16-bit Unicode, handling this? It used to be that the astral planes didn't matter much, but now they do.


That's what surrogate pairs are for. [1] You're no longer working with one code point per character, but even with 32-bit Unicode there's no real guarantee of that (consider things like combining characters, accents, emoji skin tones, etc.)

[1] https://msdn.microsoft.com/en-us/library/windows/desktop/dd3...


Soon 20 bits won't be enough, either, and every Unicode program out there will break :-(


Unicode is 21 bits wide. And there's lots of space left. Heck, Emoji still make up very little of the total encoded characters, compared to “normal” human writing systems. (And I'd argue that emoji are by now a normal addition to writing, considering how many people use them daily and can be glad to have them interoperable across different platforms, carriers, and devices. Something that hasn't been that way previously.)


Logos can never be encoded because of trademark concerns. So you're safe, there won't ever be a Facebook or Twitter code point.

Skin tone modifiers work pretty much like diacritics already do. It's not complicated and most of the support relies on the font anyway.


Perhaps we should have an escape code for SVG in Unicode, so we can describe any missing character.


Unicode Technical Report #51, which is where Emoji are laid out, talks a bit about the current thinking of the committees on this:

> The longer-term goal for implementations should be to support embedded graphics, in addition to the emoji characters. Embedded graphics allow arbitrary emoji symbols, and are not dependent on additional Unicode encoding. Some examples of this are found in Skype and LINE—see the emoji press page for more examples.

> However, to be as effective and simple to use as emoji characters, a full solution requires significant infrastructure changes to allow simple, reliable input and transport of images (stickers) in texting, chat, mobile phones, email programs, virtual and mobile keyboards, and so on. (Even so, such images will never interchange in environments that only support plain text, such as email addresses.) Until that time, many implementations will need to use Unicode emoji instead

[1] http://unicode.org/reports/tr51/#Longer_Term


I simply cannot wrap my head around the direction of the Unicode discourse.

We're discussing the appropriate code-point for different smiley faces, obscure electrical symbols[0] or, in the present case, half stars to express film or book ratings, yet we have no complete set of sub- and superscripts!

Am I mistaken in thinking it odd, that there's a complete Klingon alphabet but no representation whatsoever for most Greek or Latin subscripts? Or what if, heaven forbid, I'd want to use a 'b' index/subscript? Tough! Not even the "phonetic extensions", where subscript-i comes from, provides it.

Refer to https://en.wikipedia.org/wiki/Unicode_subscripts_and_supersc... or look for SUBSCRIPT in http://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt

Surely there's the one or two actual scientists on the Unicode consortium? Or even the one odd soul still sporting a notion of consistency who finds it only logical to provide a "subscript b" if there's a "subscript a"?

How am I wrong?

[0] https://news.ycombinator.com/item?id=11958682


Unicode is not known for its consistency in dealing with these issues. The original idea behind Unicode was to be able to represent every then-extant character set with perfect fidelity (i.e., go from X to Unicode and back, and you should get the same data). Why are there letters like U+212B Angstrom sign (not to be confused with U+00C5 Latin capital A with ring above) or things like half-width and full-width characters? Because they were present in Shift-JIS, not because of any coherent notion of what constitutes a glyph. Han unification was driven more by the need to keep from blowing a space budget than by actual rationalization of whether or not the scripts deserved separate spaces.

Note that Klingon isn't in Unicode (it was explicitly rejected by the UTC, with a vote of 9 in favor of the rejection proposal, 0 against it, and 1 abstaining). Tengwar and Cirth, though, are actually considered serious proposals for Unicode, just really, really low priority compared to, say, Mayan script (for which the first proposal should be going live in 2017). Mayan script is interesting in its own right because it's the script (well, of the ones I'm aware of) that most challenges normal conventions on what constitutes letters and glyphs.


ISTM a great deal of trouble and complication could have been prevented by three special types of NBSP that meant "sub", "super", and "back to normal". It's true that some glyphs will be special-cased by some fonts, but in general the glyph is just shrunk and translated when sub- or super-scripted.


Yes, just like the LRE/LRO/RLE/RLO/PDF/etc characters.


The Klingon alphabet was proposed but rejected.

Subscript letters were proposed as well: http://www.unicode.org/L2/L2011/11208-n4068.pdf but apparently "Not accepted: Because this has been controversial and is not directly related to repertoire under ballot, it is not appropriate to add it to Amd1 but may be considered for a future amendment" http://www.unicode.org/L2/L2012/12130-n4239.pdf

Looks like here's a recent draft for a new proposal: https://github.com/stevengj/subsuper-proposal


For those looking for Klingon, and many more ficticious fonts, there is the "ConScript Unicode Registry" [0] which defines the BMP Private Plane[1].

[0] https://en.wikipedia.org/wiki/ConScript_Unicode_Registry [1] https://en.wikipedia.org/wiki/Universal_Character_Set_charac...


Super/sub scripts are markup, not characters. There shouldn't be any in Unicode.


I beg to disagree. In science subscripts are part of the symbols, just like diacritics.

Superscripts, on the other hand, are part of math notation, like fractions and square roots.


I disagree. In math there can be super-super-superscripts, as with tetration representations https://en.wikipedia.org/wiki/Tetration . Does each get its own character, and when does it end?

In science, consider an isotope like

   180m
       Ta
    73
This cannot be represented as a sequence of symbols because that would give:

      180m           180m
          Ta   -or-        Ta
    73                   73
Markup is how Wikipedia represents it correct, as:

    <span style="display:inline-block;margin-bottom:-0.3em;
    vertical-align:-0.4em;line-height:1.0em;font-size:80%;text-align:right">180m<br>
    73</span>
How would you do it without markup?

In addition, pretty much anything can go in superscripts, including 2^א and integral equations. The most general solution is to have a "start superscript" and "end superscript" marker, with the ability to embed superscripts, but that still doesn't solve the isotope representation problem.


> The most general solution is to have a "start superscript" and "end superscript" marker, with the ability to embed superscripts, but that still doesn't solve the isotope representation problem.

Couldn't one have something like a "start zero-width superscript" marker, so that the following subscript would not be offset?


> Couldn't one have something like a "start zero-width superscript" marker, so that the following subscript would not be offset?

Well, the problem is that the subscript and superscript are both aligned with the following regular text, so you really need (for the isotope representation) a "start right-aligned zero-width superscript" marker, a "start right-aligned zero-width subscript" marker (though zero-width isn't exactly right, since they should have width, its just that only the wider of the super- and sub-script in a pair should be used in spacing the text) -- there might be other notation that also needs left-aligned versions of -- plus generic start/end superscript markers that have normal width flow, plus appropriate end markers.


It's not surprising that an offhand suggestion doesn't magically solve all problems, but I appreciate your taking the time carefully to explain what's missing. Thanks!


It would be cool to see the powerline symbols to be added to Unicode. The necessary user base should be already there.

See: https://github.com/powerline/fonts/blob/master/README.rst

A zsh theme with those characters in use: https://gist.github.com/agnoster/3712874


I have to disagree. All but 3 of those pictographs are already in the Unicode standard. You have to patch fonts because A) your preferred font may not have them and B) to make certain that the font meets Powerline's expectations.

The ones that are "unique" are a bit annoying because they replace defined characters in the Basic Multilingual Plane - Private Use section(E000-FFFF). Even though the section is "Private Use" it is often already defined by your OS's system font. There's the Supplemental Private Use Areas A (F0000-FFFFD) and B (100000-10FFFF) which can be overwritten safely.

I scare quote "unique" because two of those characters are full-height arrows; one right-pointing, the other left-pointing. These are already defined as u1F780 (🞀) u1F782 (🞂). It may be the case that some fonts that the triangles either A) don't actually go from floor to ceiling, or B) they have empty space behind their hypotenuse.

The only truly unique character is the "git branch" pictograph. Maybe, someone could write up a convincing argument to include it, but I can't imagine one. It's not a symbol you see to often even in the git community. And, I would bet if you looked hard enough, there's some mathematical symbol that would be suitable.

Just FYI, I've used powerline fonts daily for the past ~3 years.


That's great but what we really need (ahem- what I really need) is more maths-y characters, like ∑∏∫∀ and all the sub- and super- scripted letters: ⁱⁿₙᵢ and so on.

I can never find a lower-case Greek subscripted α or β when I need one...


> That's great but what we really need (ahem- what I really need) is more maths-y characters, like ∑∏∫∀ and all the sub- and super- scripted letters: ⁱⁿₙᵢ and so on.

Agreed, but what we need even more than the symbols is some ((La)TeXy, says the mathematician) way of combining them. For example (says the mathematician who doesn't understand the complexity of text encodings), why do we need a whole bunch of separate "subscript m", "subscript n", etc., glyphs, rather than just one "subscript" combining mark?


Unicode is a brilliant idea, but it went off the rails with combining characters, especially when there is both a code point for a character and a combining set of characters that semantically are the same thing.


How would you solve things without combining characters? Especially the case where you can have multiple diacritics on a letter. Encode every single combination of all of them? Seems a bit wasteful, don't you think?

Precomposed characters exist because they existed in other encodings previously and encoding such characters has been one of the core principles of Unicode to ensure an easy upgrade path. Heck, we inherited box drawing characters that way, which I think are more questionable than combining diacritics.


At a minimum, I would not have any 2-character graphemes be semantically identical with any single code point.

> Seems a bit wasteful

If they were all separate code points, how many are we talking about?

Also, consider that nearly every Unicode program handles them wrongly. That's pretty wasteful of programmer time and money.


The precomposed characters only exist for compatibility with existing character sets and encodings. If you don't want to deal with them in your code, just normalize to NFD and they're gone. If Unicode didn't care about compatibility to legacy character sets at all, adoption would have been very different, I guess. By now it's probably a moot point since not supporting Unicode is foolish at best, but in the early 90s things were very different.

As for diacritics, it depends on what you care about for precomposing them. Actual usage for scripts in use currently? Then it's only a handful and the worst thing probably is Vietnamese or Ancient Greek which have a bunch of characters with more than one diacritic.

However, the current system with composable diacritics gives you plenty of flexibility: Need a character with a diacritic that isn't used in any language currently? Just compose them and you got it. Font support may be spotty (note that Unicode and font support are completely separate things – bashing Unicode for bad fonts is a fairly useless endeavour), but at least you can represent that grapheme in text without resorting to embedding images, or overlaying glyphs by other means (cf. TeX). Those options are also not interoperable with any other applications.

It also means that if some language now develops a script based on, say, Latin, and invents a new diacritic that can go on different vowels, you'd only have to encode a single new code point, not five or six of them. It scales far better and also isn't tied to any specific writing system. I can use ´ on a or on ω and it works the same.

And could you elaborate on how “nearly every Unicode program handles them wrongly”? I'd argue that most programs coming into contact with Unicode do little more than passing it along without caring about the contents at all. And trying to shoehorn human language into something an average programmer can handle without error is likely impossible. Language is complex, writing is complex; Unicode is complex as a result of that. This doesn't only apply to text, mind you, there are lots of things that are complex and are often implemented naïvely or wrongly by programmers who don't know any better. That usually means that programs are broken, and many programmers should know better. Not that we should try adjusting the world to broken programs.


> And could you elaborate on how “nearly every Unicode program handles them wrongly”?

A good chunk don't do surrogate pairs correctly (or are even aware of them), the rest get tripped up by the combining character issue. Even for those who understand it, there are no clear answers: "should a combining character compare equal to a precomposed one?" And of course there are 3 levels of UCS support.

The whole existence of an unnormalized form is a gigantic mistake that could have been easily avoided - simply make the unnormalized form an illegal sequence to begin with.

Unicode programming hasn't gotten as bad yet as timezone programming, but they are well on their way :-(


Unicode is a fuckton of backwards compatibility. That’s the big reason those things exist.


The other day I was searching for the words for bronze in Tibetan for research on possible etymologies of some Tibeto-Burman phonetic transliterations in to middle Chinese.[0] (As you do) Anyway, I found some low resolution entries in scanned dictionaries online without romanization, but was unable to translate these to codepoints to obtain a phonetic approximation, even after using online keyboards, due to the hassles of combining characters. I have studied a lot of abugidas (Tai/Lao/Khmer/etc.) so am not exactly coming at the problem from scratch, either. Also rather shocked that the Tibetan community hasn't managed to put a decent dictionary online yet.

[0] https://en.wikisource.org/wiki/Translation:Manshu/Chapter_7#...


What about 1/4, 3/4, 1/5, etc...?


¼, ¾, ⅕.

For etc, start here: http://unicode-search.net/unicode-namesearch.pl?term=fractio...

You can use "fraction slash" to make any fraction, using super/subscript numbers: ⁷⁄₃₃


I thought they were talking about fractions too, but then I realised they're probably talking about fractions of stars.


Ah, I see. Something like ◔ "CIRCLE WITH UPPER RIGHT QUADRANT BLACK".

Someone requested something similar here [1], and someone else made it using CSS here [2]. As the article explains though, it would need to be used in text for the Unicode committee to accept it.

[1] https://github.com/FortAwesome/Font-Awesome/issues/4147

[2] http://codepen.io/denwo/pen/azjXzL


They mention that situation in the proposal doc http://files.righto.com/files/half-star-unicode.pdf Search for the title "Why just half? What about quarters or 13%?".


Thanks! The answer appears to be: nobody uses them.


I wrote a firefox addon to make it easier to input them.

https://addons.mozilla.org/en-US/firefox/addon/unicode-keybo...

\frac{13}{117} → ‌13⁄117‌


Maybe that would be a good case for combining characters. digits+/+digits = fraction. Where + is combing character and digits is digit(+digit)


Best part is where you swap Andrew West's first name for Adam


Oops, sorry Andrew! Many apologies! (I watched way too much Batman as a child and "Adam West" is wired into my brain.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: