Hacker News new | past | comments | ask | show | jobs | submit login
A Spectre Is Haunting Unicode (2018) (dampfkraft.com)
323 points by polm23 on Oct 31, 2020 | hide | past | favorite | 130 comments



One of my favorite Unicode oddities is in the cyrillic block. Some books used to use the letter "ꙩ" when writing the word eye (ꙩко). The letter was (from what I can tell) never used for anything else. Because dual is a thing, clever people then went to write ꙭчи or Ꙫчи to mean two eyes.

Both those characters made it into Unicode as there was some use in historic script. However even completely absurd variations made its way into unicode. The "many eyed seraphim" is written as серафими многоꙮчитїи. So if you need to write something with a lot of eyes, you can use ꙮ.


These four Unicode symbols have a similarly obscure origin: EDIT: automatically removed by HN

https://en.wikipedia.org/wiki/Go_(game)#Notation_and_recordi...

Relevant thread on the Unicode mailing list, with Subject "Purpose of and rationale behind Go Markers U+2686 to U+2689" is here: https://unicode.org/mail-arch/unicode-ml/y2016-m03/thread.ht...


> EDIT: automatically removed by HN

I'm curious about this. Was your post removed for containing certain unicode characters?


Probably not the whole post, just the uncommon characters. A few days ago HN deleted several characters from my comment immediately upon posting.

https://news.ycombinator.com/item?id=24851415


Just tested it, tried a comment with an emoji and it never appeared. So I'd assume so.


I googled up the paperwork proposing the inclusion of ꙮ and variants a few years ago and got the impression that a) you're not supposed to include something that only has a single recorded use and b) ꙮ pretty much has a single recorded use.


Interestingly, the phaistos disk[1] has its own little bit of Unicode and from what I understand the only known “original” use of all of the symbols is the phaistos disk itself. I guess there was enough discussion about the disk and it’s symbols to include it.

[1] https://en.m.wikipedia.org/wiki/Phaistos_Disc


The Phaistos disk sort of makes sense to me since, single use or not, it's (probably) some sort of script. I'm not really any kind of expert in ancient scripts or Unicodery and this was a while back but the gist I got from the docs was something along the lines of 'no one-time/decorative uses' and the eye thing looked like it was exactly that. The submission was from an academic in Slavic studies of some sort, I thought about emailing them to ask but couldn't really come up with a way to phrase what amounts to some version of 'did you mess up Unicode or what?' in a non-dickish way.


I guess you could argue that a single discovered extant usage is not single use, it's quite possible that there will one day be a reconstruction of the script (maybe with some more samples being discovered).


You mean the eye thing? That's from Cyrillic, a baby of a script (as scripts go) that's still in wide use today. The o-as-eyes is a decorative flourish, a little visual pun - it's still just an o. An analogous thing would be a medieval Latin parchment of, let's say, the Lord's Prayer and it opened with a bigass P with vines, a gargoyle dancing to a cat shredding on the lute and a tiny caricature of the scribe's dad. If someone found that, we probably won't end up with ʟᴀᴛɪɴ ᴅᴀᴅ ᴊᴏᴋᴇ ᴘ in Unicode.


I think wisty was saying you could make the argument that the glyphs on the Phaistos disk were also used elsewhere, but we just don't have any samples.

There's no way to know one way or the other.


Oh, sure, if it's about the Phaistos stuff - it sounds reasonable to have them in Unicode to me but much more importantly, I'm oversimplifying/butchering/misremembering whatever the actual Unicode rules are. You're better off just assuming I'm wrong about their details in important ways.


Well, it seems a suitable one character representation for "big brother".

Maybe it'll become in-vogue over the next few years? ;)


"big brꙮther"


On the other hand ꙮ has found a new use as "ornament". I sometimes spot it in pagination markers etc.


See, that’s the kind of thing that’s absolutely fascinating. Some scribe thought they were being clever a few centuries ago, and the glyph will now live on forever in our modern equivalent of a collective myth known as “the Unicode standard”, with people discovering new uses for it for generations to come.


It's a nice little doodleglyph, no doubt. I never thought I'd be virtually pointing at an internet stranger and being all 'yes, that's also one of my favourite weird things in Unicode!', though.


> One of my favorite Unicode oddities is in the cyrillic block. Some books used to use the letter "ꙩ" when writing the word eye (ꙩко).

Interesting. That "ꙩко" looks phonetically (if that's the right word, I'm not well up on linguistics (if that's the right word again, ha ha)) a bit like the Hindi word "aankh" for "eye". The "n" sound in "aankh" is emphasized less, for lack of a better term. Actually, in Hindi, it is shown as a dot on top of one of the other letters, to show that.

Also reminded by this, via George Borrow's novel Lavengro[1] (a story about gypsies), that the gypsy and Hindi words for "nose" are similar, "nak".

One theory is that the gypsies (Roma(ni)[2]) migrated from northwestern parts of India to other parts of the world, such as North Africa and Europe.

Aankh (pronounced almost like aak) and nak, get it? :)

[1] https://en.m.wikipedia.org/wiki/Lavengro

[2] https://en.m.wikipedia.org/wiki/Romani_people


Most modern Indo-European language words for 'eye' share a common origin, this predates the Romani and their migrations by a long stretch. Here's an eye:

https://www.etymonline.com/word/*okw-?ref=etymonline_crossre...

Nose is similar:

https://www.etymonline.com/word/*nas-


Romani indeed is much closer to Hindi than other Indo European languages, but the words “eye” and “nose” are not really that good indicators.

Much better indicators are the grammar, the pronunciation, etc.


> The "n" sound in "aankh" is emphasized less, for lack of a better term.

Sometimes an original nasal consonant will reduce to a vowel that remembers the original consonant only by releasing air through the nose. (Where ordinarily the air would come out of the mouth.)

https://en.wikipedia.org/wiki/Nasal_vowel

This is a big thing in Portuguese and French. (At this point, the consonants are long since gone and the nasal vowel is correct Portuguese/French. But the change would have originated in people speaking something closer to Latin, which doesn't use nasal vowels, and being "careless" with their pronunciation.)

Is this what you're talking about?


Elision of -m and nasalisation and/or lengthening of the preceding vowel was already happening in classical latin, including the one spoken by ruling elites and is attested through poetic metric and other sources. See Classical Latin. W. Sydney Allen, in Vox Latina 30–31 Also youtuber ScorpioMartianus has invested some time into training himself into using reconstructed pronunciation and has talked extensively about that, see for example https://youtu.be/psYM-LvBplw


Granted; I spoke much too broadly. According to the classical sources, -m fully disappears when followed by a vowel. (Though same-word intervocalic -m- does not.) Poetic meter backs this claim up robustly.

It's actually a little bit weirder than that; the vowel before -m also disappears. But it's certainly plausible for some nasalization to remain anyway.

> and/or lengthening of the preceding vowel

You're referring to -ns- / -nf-? You're also right there. As far as I'm aware, this doesn't happen for -nd- / -nt-, though.

> Also youtuber ScorpioMartianus has invested some time into training himself into using reconstructed pronunciation and has talked extensively about that

While that sounds like a cool project, I don't think it necessarily has a lot to tell us about the historical pronunciation. I think you could develop a pronunciation system that matched nearly every documented feature of a dead language while failing to match a large number of undocumented features.


> While that sounds like a cool project, I don't think it necessarily has a lot to tell us about the historical pronunciation

You could say the same about pronunciation research published in linguistic journals. Let me use an analogy:

Imagine looking at the source code of a game. It's technically possible for a reader to technically understand what the program is doing and understand what the game is about, how it works, it's rules and goals.

However, if you pass the sources through a compiler (whose behaviour you also can well understand) what you end up with is a game you can run and experience.

Reconstructed pronounciations are a bit like that. You get to "experience" rules that are otherwise coded in an abstract language. The effort of translating those rules into something you experience actually requires a lot of effort and expertise. You can in theory become a "compiler" and learn how to do it yourself (aloud or in your head) but it's hard; what's wrong with outsourcing it?


> Let me use an analogy:

> Imagine looking at the source code of a game. It's technically possible for a reader to technically understand what the program is doing

This is already well beyond what's possible for a dead language. It's not even possible for living languages, although in that case we can draw empiric conclusions.

I've been interested for a long time in the question of how we can determine how a language divides up the space of possible sounds. For example, English [θ] (the sound at the beginning of "thick") is perceived by Mandarin speakers as being the sound [s] (as in "sick"). It is perceived by Cantonese speakers as being [f] (as in "fickle").

The sounds [s] and [f] are both phonemic in both Mandarin and Cantonese. But something about the phonology of each pushes the sound [θ] into one category or the other. The choice is not arbitrary; it is quite consistent across speakers of each language.

To the best of my knowledge, we have no way to answer the question "how would language X categorize sound Y?" other than experimentation, which is impossible with a dead language. But it is a fact about the language, and in principle the question can be answered solely by looking at the pronunciation of sounds within the language -- in the ordinary course of events, a Chinese speaker would go their entire life without being exposed to the sound [θ], and yet they would largely agree with each other on what the sound was if they did hear it.

I say that this categorization question draws upon rules of pronunciation which we don't presently have a good idea of how to describe or characterize at all.

So I say reenactment of a dead language is an interesting project, but you're inevitably going to make choices that are wildly different from the language as it existed in the past. Pronunciation reconstruction is on much firmer ground -- and it gets there by not addressing most questions. But a reenactment cannot avoid addressing every possibility, and it's going to get most of them wrong.


> It's not even possible for living languages

YMMV. I once watched a short video by an accent coach teaching how to make an Irish accent, a Scottish accent, an Australian accent etc. He talked about place of articulation and made pretty decent (although clearly not native) approximations of the pronounciations. I found his attempts at actively voicing things out quite helpful. I'm fully aware this is just an approximation, but in a way I found that teacher to be more effective at conveying what makes a given accent peculiar, more than what just listening to a native speaker would. Probably it all depends on what you're interested in.


Interesting.

>Is this what you're talking about?

I'm not quite sure, since I don't know much about phonetics / linguistics, as I said above.

Something like this, the French sound from near the top of your link above:

https://upload.wikimedia.org/wikipedia/commons/0/0e/Fr-en.og...

But that nasal sound is not exactly the same as the nasal sound in aankh (at least to my untrained ear).

Can't describe it better than that, sorry.


I love this. There are lots of characters in Unicode I'd love to know the history of, but searching for such things is hard, at best.

The one I've stumbled across recently are the "Negative Squared Latin Capital Letter" (1F170 - 1F189 [1]). The 26 letters are there, but "A", "B", "O", and "P" are special. Blood types and a parking symbol. Sure, but why? Why was it decided we needed all of the Latin letters in inverted squares, except those? Why aren't they their own symbol? I'm not complaining here, I just want to know the history.

I'd also love to know the history of the other Latin letter ranges. It feels super odd to me to have, say the "Mathematical Bold Fraktur" range of characters (1D56C - 1D59F [2]). What's the history of including a font in Unicode? Why did we stop with the few that are in there?

This is probably somewhere online, but I can't find it.

[1] https://unicode-search.net/unicode-namesearch.pl?term=negati...

[2] https://unicode-search.net/unicode-namesearch.pl?term=mathem...


There's a popular belief in Japan that a person's blood type influences their personality[0]. This probably made Japanese emoji designers add them to their pre-unicode emoji sets, which then got rolled into the first set of unicode emoji in 2010[1], along with a bunch of other Japanese symbols.

[0] https://en.wikipedia.org/wiki/Blood_type_personality_theory

[1] https://emojipedia.org/unicode-6.0/


>Sure, but why?

Some OSes display some unicode characters as emoji-default. These characters were selected for that for exactly the reason you surmise. There’s a [text presentation selector][] to enforce the text variation. This is also useful for making (yellow triangle with exclamation point inside) become (single color triangle outline with same-color exclamation point inside).

ETA: Ah, OK, here’s a post with pictures demonstrating this: https://stackoverflow.com/questions/48534667/how-to-display-...

[text presentation selector]: http://www.unicode.org/reports/tr51/#def_text_presentation_s...


> Some OSes display some unicode characters as emoji-default.

Those characters have the Emoji_Presentation Unicode property, as listed in https://www.unicode.org/Public/13.0.0/ucd/emoji/emoji-data.t...

If a site wished to exclude garish blobs from comments while permitting textual symbols, that would be the right way to do it.


The committee thinks/thought that the Fraktur letters aren’t just a different way to write letters, but symbols with a different meaning, just as ℂ, ℕ, and ℝ aren’t a different way to write C, N and R.

(And yes, they included not only 𝔸𝔹ℂ𝔻𝔼𝔽𝔾ℍ𝕀𝕁𝕂𝕃𝕄ℕ𝕆ℙℚℝ𝕊𝕋𝕌𝕍𝕎𝕏𝕐ℤ, but also 𝕒𝕓𝕔𝕕𝕖𝕗𝕘𝕙𝕚𝕛𝕜𝕝𝕞𝕟𝕠𝕡𝕢𝕣𝕤𝕥𝕦𝕧𝕨𝕩𝕪𝕫 and 𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡 in Unicode. In iOS Safari, using the font for entering HN comments, ℂℍℙℚ renders a bit taller and bolder for me. ℝ and ℤ render taller, but not bolder. Why beats me.)

I agree with them on ℂ, ℕ, and ℝ. I also think that makes it hard to disagree with them on the Fraktur letters.


These can result in crossness amongst mathematicians: everyone uses them on the blackboard (they are called blackboard-bold after all) but half use them in printed matter and the other half insist on actual bold. Discussions on the matter lead to cold stares and the poking of fingers in the chest.


CHNPRQ and Z are the ones with actual canonical meaning:

- Complex

- Quaternion (H for Hamilton)

- Natural

- Prime

- Rational (Q for quotient)

- Real

- Zinteger

just kidding. i think Z stands for something in German.


I have also seen

- ℙ used to denote a probability distribution

- 𝔼[X] as the expected value of a random variable X

- 𝟙_S as an indicator function for a set S (which is denoted as a subscript)

- 𝕕 as the differential and integral operator denouncing which variable to integrate over 𝕕f/𝕕x f(x) or ∫ f(x) 𝕕x


I've also seen 𝔸 for affine space, 𝔹 for the Booleans, 𝔽 for a field, 𝕂 for a field, 𝕆 for octonions and 𝕊 for surreals.

I think I've seen 𝕀 and 𝕚 for identity matrices and suchlike, and I'm sure 𝟘 also gets some use.


Z for Zahlen, "numbers". "Zinteger" is an improvement, being more precise.


N are the positive integers including zero, Z are all integers.


This is the math equivalent of tabs vs spaces


Isn't N just the positive integers and W the one with zero?


No, the positive intergers are Z⁺ (Z^+ / Z<sup>+</sup>). N is natural numbers aka nonnegative integers aka positive integers along with (not "including", since it isn't positive) zero.


What bothers me is that for some reason only some latin letters and numbers are available as subscripts and superscripts...


Most of the Latin subscript glyphs are from the Phonetic Extensions block. They only added those that were actually used in various phonetic notations.

This Wikipedia page has a good list of why various ones were added, in addition to those from the Subscripts and Superscripts block: https://en.wikipedia.org/wiki/Unicode_subscripts_and_supersc...


> The committee thinks/thought that the Fraktur letters aren’t just a different way to write letters, but symbols with a different meaning, just as ℂ, ℕ, and ℝ aren’t a different way to write C, N and R.

As to the mathematical uses, this thought is correct. Niche, obviously, but Unicode has no trouble with niche characters.

It's hard to justify the full alphabet on those lines, though.


Is it though?

There's, imo, a reasonable expectation that if some of those characters were used then some other might be later, and it's easier to have them already in the standard than reserving space for them and then having to backfill it, I suppose.


> In iOS Safari, using the font for entering HN comments, ℂℍℙℚ renders a bit taller and bolder for me. ℝ and ℤ render taller, but not bolder. Why beats me.

Font fallback. HN specifies Verdana, Geneva, sans-serif. Geneva has the six characters mentioned. The rest will be rendered using a different font that covers more parts of Unicode, or that covers Unicode math specifically. With the built-in system fonts, that would be Cambria Math on Windows and one of the STIX fonts on macOS.

(Source: Firefox → Inspect Element → Fonts tab of the right Inspector pane. Chrome can do that too in a slightly different place, macOS Safari is too user-hostile to have this feature.)


I guessed that, but still can’t figure out what makes ℂℍℙℚ special. Looking again, I notice ℕ is taller, too, but not bolder. ℂℍℕℙℚℝℤ is much more “I can see why a font would have these, but not others”. That solves the issue enough for me.

(And, by the way, iOS doesn’t have Geneva, but it has a version of Verdana)


Weird that this comes up today. My Steam handle has an O and I used "negative squared Latin" characters -- I was investigating this yesterday (I found no answers!).

Last year my Steam name showed properly on MS Win10 and on Kubuntu. About 6 months ago Win 10 started showing it with the O as red. Last week the red O doesn't show at all in one part of Steam but shows white in another part.

I copied the symbols to a Unicode decoder and it showed the O name with a {blood red} modifier, something like that.

I searched briefly but couldn't find the symbol without the red colour, but it only shows red in some places: I'd pasted it into the search bar in Firefox (on Kubuntu), it showed as a number-square (1Fxxx) but then after searching it showed as the red-O character.

Utf8 is weird.


In case you enjoy the technical pedantry: UTF-8 is "just" an encoding scheme to represent the characters defined in the Unicode standard. There are other schemes in the standard (like UTF-16 or UCS-4).

Your issue here is with the characters and their representation, not with the specific encoding. Hence, what you wanted to say is: "Unicode is weird" ;)


This is the most nicely delivered pedantry I've ever read, kudos!


Here you go:

    o = "\U0001F17E"
    black = o + "\ufe0e"
    red = o + "\ufe0f"
    print(black, red)
HN seems to strip the special sequences, here it is on a pastebin instead: https://paste.debian.net/1169485/

More details: http://unicode.org/faq/vs.html http://www.unicode.org/Public/emoji/5.0/emoji-variation-sequ...


Kwpolska, most of your comments since 2011 are shadow banned. Occasionally HN readers vouch them back into visibility.


Fraktur is used in mathematics, chiefly when discussing/denoting Lie groups. I've listed some sources but you can probably Google for more.

[0] https://en.wikipedia.org/wiki/Fraktur#After_1941

[1] https://mathoverflow.net/questions/87627/fraktur-symbols-for...


That's fair, but why stop there? The example that comes to mind is "Courier 12pt is the only font ever for screenplays". It's required to convey a screenplay, to my mind, like using Fraktur is required in the math space.

Despite what it may seem like, I'm really not trying to mis-parse the reasons here, I'm honestly trying to figure out where the line is, and why it's there.


> why stop there? The example that comes to mind is "Courier 12pt is the only font ever for screenplays". It's required to convey a screenplay, to my mind, like using Fraktur is required in the math space.

The screenplay is still being written in letters. Mathematical ℝ is more accurately thought of as an ideogram than a letter. If you were to write "let r be a member of ℝ", the "ℝ" would be structurally parallel to the full word "member", not to the "r" within it.

Courier for screenplays is a choice you make at the document level; blackboard bold for mathematical entities is not. ℝ is always ℝ no matter what styles apply to your document.


I really don't know, but my guess would be that in mathematics particular symbols denote particular meanings depending on the shape of / decorations on the character. Big g is different from little g which is different from bold g which is different from italic g which is different from Fraktur g. Because Fraktur g is different in meaning from just g, Fraktur gets a spot in the Unicode specs. Courier, while it is standard for screenplays, does not affect the meaning of the text of the screenplay as opposed to the use of another font.


Historically, a few were included in the Letterlike Symbols block¹ because they were present in some pre-Unicode character set. Much later it was argued that since a few were present, they all should be present.

¹ https://en.wikipedia.org/wiki/Letterlike_Symbols


Also note that these selective ones were part of the basic multilingual plane, where space was always a bit at a premium. They were assigned before Unicode expanded to have 17 complete 16-bit planes and space stopped being a problem.


You mix the Fraktur and Roman letters in the same mathematical manuscript, the way you mix Greek and Roman letters in the same mathematical manuscript.


Fractur is more than a font difference as it has a few ligatures (e.g. tz) that are not found in the Roman alphabet as used in modern German (ß is the only one that made the transition). And these “ligatures” aren’t really true ligatures in the sense of, say, the “fi” ligature in some fonts; they are glyphs that are close to being fully fledged letters, as, say, ö, which is an accented (umlautened) letter in Germany is a fully-fledged letter in Swedish, or how W became a freestanding letter in English.


The introduction of the Fraktur font introduces different meaning. 'R' in Fraktur would mean something different than R in another font in the same text.


I think you could argue the same for typewriter (monospace serif) fonts. Plenty of texts use them to denote the name of a variable or function in-line, much as we would use backticks to talk about `leftPad` here.


𝚃𝚑𝚊𝚝'𝚜 𝚊 𝚐𝚘𝚘𝚍 𝚙𝚘𝚒𝚗𝚝. 𝚃𝚑𝚎𝚢 𝚜𝚑𝚘𝚞𝚕𝚍 𝚒𝚗𝚌𝚕𝚞𝚍𝚎 𝚝𝚑𝚘𝚜𝚎 𝚒𝚗 𝚄𝚗𝚒𝚌𝚘𝚍𝚎. 𝙾𝚑.


This is actually what started me down this path a while ago. I had to do a full text search feature for some text of questionable sources, and some users had taken to using the full width characters for emphasis (I think, they clearly had rules in their head for whey they'd use it, but I didn't know what the rules were). There are libraries that can handle the official Unicode normalization rules, but users don't exactly always pay attention to the official rules, so I get to start finding all sorts of weird little corners of Unicode.

Though, as I understand it, the full-width characters are there not for any modern use cases, but for historical reasons having to deal with older character sets.

Still interesting.


Full-width latin characters are used to fit in the grid of Chinese/Japanese/Korean characters...they're not going anywhere.

https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms


I don't think there's much call for fullwidth Latin characters for that purpose. Ordinary use means typing with whatever your input method gives you. This is generally not fullwidth characters.

A clean grid would be desirable in formal use, but formal use means trying to avoid Latin characters as much as possible. It's generally possible. Plaques and the like are much more likely to say e.g. 二〇二〇年 than to say 2020年.


Grids are not just for formal use, they're useful any time you want to have aligned text, e.g. if you want to write a markdown table mixing Latin and CJK characters.

And I doubt you'd want to eliminate all formal uses of Latin characters. E.g. a plaque about a person would likely want to use their preferred name, which might be in Latin characters.


I think the full adoption of different alphabet styles as independent unicode glyphs is, overall, a conceptual mistake.

But, note that the identical process, much earlier, is how we got separate capital and lowercase forms. Writing systems never do that when they're developed.


That style of using full width characters for emphasis might be called vaporwave, or at least related to it.

V A P O R W A V E

AESTHETIC

https://en.m.wikipedia.org/wiki/Vaporwave


Spacing was used for emphasis in blackletter, and consequently persisted in roman in German even after other means (e.g. italics) became common in other languages. https://de.wikipedia.org/wiki/Sperrsatz


Welp!


> The one I've stumbled across recently are the "Negative Squared Latin Capital Letter" (1F170 - 1F189 [1]). The 26 letters are there, but "A", "B", "O", and "P" are special.

In which way are they special? As far as unicode is concerned they are all the same. It's just that some have an emoji rendering variant for the blood type reasons. The rendering can be picked through representation characters: http://www.unicode.org/reports/tr51/#def_text_presentation_s...


I don't see what's special about ABOP on that page?


Those four are listed specially as having optional emoji presentation: https://unicode.org/emoji/charts/emoji-variants.html

These all have Emoji_Presentation=No as specified in emoji-data.txt, for precisely this reason (to avoid discrepancies in rendering), but most platforms don't respect these defaults, as it's common to get strings containing emoji from mobile devices, which usually default to emoji presentation. I talk about this a little in my recent "Text layout is a loose hierarchy of segmentation" blog post.


Interesting, every font I've seen renders them like this: https://imgur.com/a/ZPBfaK0

I thought it was suggested to render them like that from Unicode somewhere, but it's 100% possible I'm making that part up.


My recollection is for the textual characters that got repurposed as emoji, it’s implementation-defined as to whether the character defaults to text or emoji but that most renderers opt for emoji (which I believe is copying Apple’s decision here). There are variation selectors you can use to force textual or emoji presentation.


It's not implementation defined, i's defined by the Emoji_Presentation property in https://www.unicode.org/Public/13.0.0/ucd/emoji/emoji-data.t...

Some implementations are just wrong.


That file is new in Unicode 13. Also I think it might be based on Apple's behavior. I just tested every single emoji codepoint listed there, and it's almost entirely consistent with Apple's default rendering. And the very few exceptions (like ) actually seem to depend on whether the text renderer is styled or not, where it becomes emoji in styled fields (like Spotlight) and text in plain-text fields (like this comment box). This seems to only be done for the small handful of pre-existing text symbols that got turned into emoji and given the Emoji_Presentation property.


On my Android Chrome those characters are indeed not rendered specially.


Doesn't do anything special on desktop Firefox, but on a chromium browser A,B,O and AB have a red background while P has a green background


Another interesting character is U+2189 Vulgar Fraction Zero Thirds: ↉

IIRC, it also came from a Japanese code page and was probably used for baseball scores, although the exact origin and usage remains a bit of a mystery.


It is very likely that this symbol is added for baseball. It's not from the Japanese national standard but from a character set for Japanese TV sets called ARIB charset (1). According to Wikipedia (2), ↉ is actually used for baseball scores. It says that if a pitcher is removed before any batter is put out, he threw ↉ innings for that inning on the record.

1 https://en.wikipedia.org/wiki/ARIB_STD_B24_character_set

2 https://ja.wikipedia.org/wiki/%E6%8A%95%E7%90%83%E5%9B%9E


> nobody could tell what they meant or how they should be pronounced

Definitely the legend of this area is Korean[1]. Almost 95% (or even more) of characters have never been used and no one know how to pronounce it. Even from the first line of the blocks, I could find weird characters like 갅, 갌, 갍, 갎.

[1] https://en.wikipedia.org/wiki/Hangul_Syllables#Block


> Almost 95% (or even more) of characters have never been used and no one know how to pronounce it.

This is absolutely false. KS X 1001, the primary character set for Hangul, contains 2,350 out of 11,172 (modern) syllables, which is definitely much larger than 5%. And that's not enough (KS X 1001 itself was heavily criticized for this), there are multiple secondary character sets got into Unicode; Unicode 1.1 had 6,656 arbitrarily ordered characters that were finally replaced by 11,172 neatly ordered characters by 2.0. They are real characters people use, my very old personal research [1] indicates that at least a half of them has legitimately appeared in the chat log for example.

Pronunciation-wise, multiple consonant clusters in the final position indicate conditional pronunciation. "갅", for example, comprises of ㄱ g + ㅏ ah + ㄴ n + ㅈ j which is normally pronounced "gahn" by its own but "gahn-j" when followed by a vowel (e.g. "갅이" is pronounced "간지" gahn-ji). Phonetically Korean only has 7 possible codas and other final consonants exist because Korean orthography is a compromise between phoneticism and ideographicism and some words had to be adjusted to show their morphological roots.

[1] https://j.mearie.org/post/24348147729/hangeul-usage-in-irc-c... (in Korean)


I love how weird errors like this can affect culture and the language itself.

There was an album put out last year named 彁, which is one of the more well-known ghost kanji: http://www.rd-sounds.com/c97.html


How is it pronounced given that this is a non existing kanji?

That's a similar problem as Prince's name during the time he had a symbol as name for which the workaround was just to call it "symbol".


Read it by the implied phonetic component, 哥。

More broadly speaking, there are some true facts cited in this post, but the basic premise is horseshit, and the whole of it should be retracted.

As Andrew West has pointed out, these characters were not "made up" as they are real characters attested in Chinese texts. From a Unicode perspective, none of them is a "ghost character" at all.


I disagree with your characterisation as "horseshit" -- although eleven of the characters were subsequently located in other texts, these texts are different from the sources that were supposedly cited when they were input into JIS C 6226 in 1978 [1]. Of the twelve characters cited as "ghost characters", nine of them had claimed sources, but on further investigation were not in the claimed sources or were data entry errors of similar characters in those sources, while three did not have any claimed sources, of which two were later found in other texts (and one of those is believed to be an error), and 彁 has never been located in any text predating its input into JIS C 6226.

[1] https://ja.wikipedia.org/wiki/%E5%B9%BD%E9%9C%8A%E6%96%87%E5...


The problem is that we don't, and probably never will, be able to know if it was actually used or not. Chinese characters are very flexible and the phonosemantic construction of 彁 is fully justified, we couldn't just attest it (before 1978). The term "ghost characters" is merely colloquial and doesn't correctly reflect the reality of standards.


I and my friends called it TAFNAP.


TAFKAP, surely?


But probably pronounced as TAFNAP.

What a language.


Wow, didn't expect to find a reference to RD-Sounds on HN.


> Following the general adoption of the JIS standards these characters all made their way into Unicode

This is the exact kind of place lost glyphs would find a home. The UNIHAN effort (which spans tons of sources) is very dedicated to cataloging these efforts.

I run a few projects based around UNIHAN, which is geared toward cataloging glyphs that are variants, many of which are archaic and no longer used.

Write up about UNIHAN: https://unihan-etl.git-pull.com/unihan.html

I have a tool to export Unicode's database of CJK characters to CSV, JSON, and so on: https://unihan-etl.git-pull.com/

Python library: https://cihai.git-pull.com/

CLI tool (compare to cjklib [1]): https://cihai-cli.git-pull.com/

No knife / character decomposition. I'm working on it but the data is GPL'd: https://github.com/cjkvi/cjkvi-ids/issues/65


Slightly related : On French keyboards there is a key dedicated to the letter ù (u with an accent) which is used in one word only in the French language : "où" which means "where".

("ou" without an accent means "or", sql would be funny in French : SELECTIONNE * DE matable OÙ type='A' OU type='B' )


I've been wondering why there are subscripts for only 12 lowercase subscripts: ₐₑₒₓₔₕₖₗₘₙₚₛₜ

https://en.wikipedia.org/wiki/Superscripts_and_Subscripts_(U...

Anybody know the reason?


It's used for the Uralic Phonetic Alphabet.

https://www.unicode.org/L2/L2009/09028-n3571-upa-additions.p...


Interesting, how did you figure that out?


You can look up the characters in the database, they fall under "Subscripts for UPA".



> For example, 妛 was an error introduced while trying to record "山 over 女". "山 over 女" occurs in the name of a particular place and was thus suitable for inclusion in the JIS standard, but because they couldn't print it as one character yet, 山 and 女 were printed separately, cut out, and pasted onto a sheet of paper, and then copied. When reading the copy, the line where the two little pieces of paper met looked like a stroke and was added to the character by mistake. The original character (𡚴) was not added to JIS or Unicode until much later and doesn't display on most sites for me.

Meanwhile, in a parallel universe where ASCII was invented in asia and latin characters were not available until unicode was established:

> For example, Ä was an error introduced while trying to record A below a dotted line. When reading the copy, the dotted line was added to the character by mistake. The original character (A) was not added to Unicode until much later and doesn't display on most sites for me.


I know, it's not really the same thing, but this reminded me of the RYRYRY test code used with Baudot at the beginning of teletype messages.

https://en.wikipedia.org/wiki/Baudot_code


The only spectres that are haunting Unicode still are U+3164 HANGUL FILLER and U+ffa0 HALFWIDTH HANGUL FILLER, which are not whitespace but valid ID_Start and ID_Continue chars. Zerowidth whitespace as identifiers are illegal. eg https://github.com/jagracey/Awesome-Unicode#user-content-var...

These unused glyphs on the post are harmless, because unused.


Many of the original Hong Kong extension is for male and female sex organ used in Cantonese. They were there for court to write down what the gangster talk to each other exactly.


related:

Did Burmese typewriters contain an upside-down character, which subsequently became proper typewriter style?

https://skeptics.stackexchange.com/questions/49653/did-burme...


A printing mistake was also responsible for the lambda in Lambda Calculus.


Another printing mistake gave English the word "dord"[0] (at least temporarily, and from a prescriptivist point of view). While this "word" didn't influence Unicode, it is somewhat analogous to the situation in Japan where there are symbols that are arguably in the language but which are never meaningfully used.

I assume that the situation is even stranger in Japanese because these mistakenly created symbols do not have either an associated definition (even a mistaken one) or a pronunciation.

[0] https://www.youtube.com/watch?v=nb0YoRMXIY0


Possibly not, according to [0], which links to [1], presented by a student of Church.

0: https://math.stackexchange.com/questions/64468/why-is-lambda...

1: https://www.youtube.com/watch?v=juXwu0Nqc3I


I thought it was more of an accident? The story I know goes that Church used hats over variables, and then he switched over to lambda prefixes for ease of printing.


I have often thought that if I were to teach a class on functional programming, I would use JavaScript’s fat arrow notation to introduce lambda calculus. Less succinct with all the extra parentheses that occur in function calls but I think it’s easy enough to explain.


Pretty much the same idea, but in math contexts I tend to use the \mapsto arrow (↦), since mathematicians are more familiar with this than lambda calculus. (The fat arrow seems like an interesting alternative in programming contexts, but I worry that it might be confused with implication.)

The function that swaps the arguments to a two-argument function: f ↦ ((x, y) ↦ f(y, x))


Interesting. The mathematical notation I learned in school was: f:x↦2x+3, So I guess this would translate to f:(x,y)↦(y,x).

Alternatively, the f(x,y) = (y,x) notation would work (sort of, you tend to explicitly write out the base vector 𝑒ₖ for both notations: f(x,y) = f(x𝑒₁+y𝑒₂) = y𝑒₁+x𝑒₂ )

https://en.wikipedia.org/wiki/Function_(mathematics)#Arrow_n...


That's close but not quite how I meant for the notation to be interpreted: f is an argument to an anonymous function. Maybe clearer is

swap = f ↦ ((x, y) ↦ f(y, x))

The rule this satisfies is

swap(f)(x, y) = f(y, x)

and the type for swap is

swap : (X × Y → Z) → (Y × X → Z)

for some types/sets X, Y, and Z.

(Also, note that tuples might not be anything like a vector space. For example, for String × [Int], you wouldn't usually write ("hi", [2,3]) as "hi"𝑒₁ + [2,3]𝑒₂ since there's not really a good commutative addition operation for strings or lists.)


And here it is:

(Now we see how many people get the joke.)


Wow. Now even I don't get the joke.

When I posted, the post included this character:

U+262D

https://www.fileformat.info/info/unicode/char/262d/index.htm

Unicode Character 'HAMMER AND SICKLE' (U+262D)

Yep, it gets eaten by HN. It's in the BMP, dammit! It should be safe!


HN intentionally removes “emoji-type” characters (I don’t know what the exact criteria are, it’s definitely more complicated than which plane the character is on) from posts. The only time I’ve seen this cause issues is when trying to discuss Unicode itself; presumably the moderators decided this is an acceptable tradeoff.


Automatically silently stripping characters from posts seems like a bad idea. If we want to forbid characters from posts, wouldn't it be way safer to prevent the post and tell the user to remove them?


There's no obvious criteria. 𓂺


Could you not have picked a different emoji (also why is that a a character?)


> (also why is that a a character?)

It's in Unicode because it's a hieroglyph. Now why is it a hieroglyph I think you would need to ask ancient Egyptians.


Possibly the entire "Miscellaneous Symbols" block?


It's invisible, it's a ghost character!


I see it fine. Enable showdead in your account preferences.


Then who was phone?


"A spectre is haunting Europe — the spectre of communism." The Communist Manifesto


Is there a symbol for fascism?


There are swastikas facing both directions as religious symbols, but not angled Hakenkreuz like the Nazis tended to use. There also isn't a fasces or anything like that, although I feel like that's a less widely recognized symbol (and of course it is used in non-fascist contexts as well).

Probably the most interesting political symbol in Unicode is the Emblem of Iran: https://en.wikipedia.org/wiki/Emblem_of_Iran

It's referred to officially as "FARSI SYMBOL", which apparently was a euphemism chosen because during ISO standardization of Unicode the original name "SYMBOL OF IRAN" was deemed unacceptable. As a logo, it wouldn't make it into the standard today, but nobody knows how it originally got in there:

http://archives.miloush.net/michkap/archive/2005/01/29/36320...


The Celtic cross is also there, tho tipically represented as asymmetrical, while the one used in fascist context is symmetrical.


W: "The Italian term fascismo is derived from fascio meaning "a bundle of sticks", ultimately from the Latin word fasces.

The symbolism of the fasces suggested strength through unity: a single rod is easily broken, while the bundle is difficult to break."

So just look for a farming-related item with 2 or more stick-like objects bound together.

https://en.wikipedia.org/wiki/Fascism





Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: