Thai character ก็็็็็ (ก) gets rendered in a strange way

contingencies · on March 12, 2013

Here's a basic explanation of the diacritic notion as it applies to Asian scripts.

Thai belongs to a family of scripts known as abugidas. Abugidas include pretty much all South Asian and many Southeast Asian scripts, for example Burmese, Cambodian, Dai, Lao, Thai, etc. They all pretty much derive from Brahmi, which was the proto Indian script. You can see an example of Brahmi over here: http://en.wikipedia.org/wiki/Brahmi

Abugidas are based upon combining multiple glyphs in to syllables, often allowing glyphs above, below, to the left and to the right of the initial consonant, and often including a closing consonant. Most glyphs tend to be consonants, though some are vowels, and others can be special marks for indicating tone or other notions. Often shorter vowels are excluded (as in Modern Standard Arabic).

In old times, such scripts were handled with wacky font-hacks. However, with Unicode, there are some super complex algorithms that make glyphs combine both visually (when typesetting) and logically (when saving/searching/etc). You can actually type a character and a diacritic and it can sometimes automatically combine to form a single character, if such a beast exists, not just visually but when saving to disk.

What makes it even more confusing is that South Asian scripts in particular have mega-combo characters, where whole chunks of glyphs sort of fold in to flowing short-hand symbols. In the case of Sanskrit, I believe loads of these were used in history but few are used these days.

I think that's a fair pontification - corrections welcome!

ubasu · on March 12, 2013

I wouldn't call them "mega-combo characters" ;-) Typically, these are at most two or three letters written together, and they are very much used today in Sanskrit, Hindi and other regional variants such as Bengali etc.

Of course you also have multiple words that have been combined into one large compound word, by way of appropriate linguistic rules of combining sounds. This is similar to long compound words in German.

contingencies · on March 13, 2013

Apparently sometimes called conjuncts. http://pratyeka.org/sanskrit/conjuncts.html

tikwidd · on March 13, 2013

This is the term I'm familiar with. They are similar to ligatures which exist in the Italic scripts e.g & (e+t), œ (o+e), German ß (ſ(long s)+s) etc. But conjuncts in Devanagari are much more numerous and widely used, probably because the individual consonant forms are fairly complex.

sebilasse · on March 12, 2013

i have learned the thai alphabet and that "mega-combo character" comment just made my day. Reading thai pretty much feels like reading regexps. Anditdoesntmakeitanyeasierthattheydontusespaces

contingencies · on March 12, 2013

Actually Thai only has roughly twice the number of characters that we have in Roman scripts - excepting tones (a real pain) it is possible to learn pretty quickly. Lao by contrast has less, but has a rather tricky plethora of vowel combinations for a myriad of hard to distinguish eww, ieww, ooh, iuooh type sounds. :) Cambodian has no tones and is my pick for the one to go for if you are keen on an easy starter.

Tangential tidbit: I sent a copy of The Cambodian System of Writing (http://pratyeka.org/csw/) to TPB's anakata while he was solitary confinement to help him stave off boredom. No idea if he ever read it, though his mother assures me it arrived.

vorg · on March 12, 2013

Besides ก (0xe01), the rest of the Thai characters (in 0xe02-0xe59) are letters ขฃคฅฆงจฉชซฌญฎฏฐฑฒณดตถทธนบปผฝพฟภมยรฤลฦวศษสหฬอฮฯะาๅ, alphabetic non-spacing marks ัิีึืฺุูํ which can follow the preceding characters, and render as stacked (as shown by the OP), five "marks" เแโใไ (with unicode's Logical Order Exception) which can precede the following character, diacritic non-spacing marks ็่้๊๋์๎ which also render as stacked, modifier letter ๆ ,digits 0-9 ๐๑๒๓๔๕๖๗๘๙ ,currency symbol Baht ฿, and punctuation symbols ๏๚๛

Thai looks to be a pretty gr๏๏vy language, like many other natural languages - perhaps some programming languages will catch up in their enhanced use of lexical tokens one day, instead of just relying on grammar, long English names packed into name hierarchies, and multi-ASCII symbols.

boshea · on March 13, 2013

I think I would steer new learners of SE Asian scripts away from Cambodian as a first language to learn. Cambodian, while it is a beautiful script and has more-or-less regular pronunciation, also has a few odd exceptions, and complex vowel pronunciation rules. The consonants are divided into two groups, and many of the vowels are pronounced differently in the first group than in the second. That said, after learning the Thai script, Lao and Cambodian were not too hard. Either way, they are all great languages to learn.

contingencies · on March 13, 2013

Actually I am almost certain Thai and Lao have those consonant divisions as well, in fact I believe 3 or 4. If I am not mistaken they are still taught and are part of the tone system and/or can affect unwritten vowel selection. More certainly, the consonant classes somehow stem from the need to preserve pronunciation of Pali, a middle-Indian prakrit language (with features not present in these SEA countries' modern languages) that is used as the littoral language of Theravadin ("older school") Buddhism. See http://pali.pratyeka.org/ for more info on that.

sirn · on March 13, 2013

(Native Thai speaking.)

The worst thing about "not using space" when it comes to computer is that it's nearly impossible to do word-breaking/line-breaking without relying on dictionary[1] which is very hard to convince software developers not using system text engine to add support for line breaking[2].

The suffer still continues to this date, Android, for instance, still lack a proper support (break by character instead of word) and few apps lack of Thai word-breaking support at all (Twitter, whose only do word break by space).

[1]: http://linux.thai.net/svn/software/libthai/trunk/data/ [2]: https://bugzilla.mozilla.org/show_bug.cgi?id=7969 it took Mozilla 9 years. Opera, on the other hand, still lacks proper Thai support.

darkstalker · on March 12, 2013

Stuff like that is what was used to implement zalgo text

Ę̮̱͔͓ͯ͗ͫ̌̏ͫ͌́x̘̤͚̰̫̫̗̤̱̒̓ͨͯ͑̓ͥͫ̕å̰͚̓͒ͫm̛̤͕̫̳̺̩̄̓ͨͥ͜ͅp̰͉͗ͤl̵̖̗̫͍͓͋̍̐͌̐̒e̡̧͔̮̿͒͋̈́͡ ̸͉͔͗͐̍ͩͫ̀ͭz̨͎̱̟̘̓ä́͊̉̾͜͏̺̲̘l̛̥͇͖̹̻̜̈̀̀g̴̗̻͚͙̭͍̩̔̉̆ͦ͌͘oͬ̾͑̉̋҉̢͙̹̹̺̺ ̷̢͖̲͇̺̪̹̙̺̘͐̄ͬ̍͆t̶͔̣̜̟͌̀ͪ̅ͧ̒̒ͫ̚ȅ̠̪̻̄ͫ̋͝xͭ͆͝͏̮͔̜t̟̬̦̣̟͉͈̞̝ͣͫ͞,̡̼̭̘̙̜ͧ̆̀̔ͮ́ͯͯ ̢̮͎̦͙͇ͪͪ̈͌ͬ̄̓̐͞ḷ̹̺̙̜̇̉́͡o̢̻̪̠̬̍͐̉ͮͥ̑͊ͪt̢̘̬͓͕̬́ͪ̽́s̢̜̠̬̘͖̠͕ͫ͗̾͋͒̃͛̚͞ͅ ̝̣̥̳͇͎̭̾̔̀̀̔̽̕o͇ͮ̋̅͋͆̈́̔͗͟f̙̙͕̮̈ͪͯ̿̈͠ ̯͎̺͎̺̃̀͟͟d͍͍̺͂̂i̪̩̙̭̝͖ͥ͂̂̈̒̎r̥̜̃̏̃͋̓ͥ̃̉̄͘͢t̳̦̬͆͂ͬͧ̏ͬ̓y̵̮̗̟ͩ̃̾͐́ͩ ̣͍̘͈̫͓̊ͤ̚͡͝cͥͭ͐̎͆͘̕҉̫̞h̴̢̫̘͉̖ͪͩ̓ͪͯ̑͑̓̎͝a̧̢̖͔̗̬̘̯̟ͪ̐͌̍͂̊r̷̝͓̬͆̄̽̓̋ͬ̈̔͝͠ā̗͑ͬ̀c͒̎͌̔͛͘҉̘͖͖̖̯̖͖͙ṱ̶͇͚͎ͯ͋͢͝eͦ̽͆͏̟̭̠r̙̖͙̳̾ͯ̈̕ṣ͙̈͆̔͗̉ͥ̋̔̕

windsurfer · on March 12, 2013

I've been told off before for pasting that on HN before. I'd suggest you edit your comment.

spiralganglion · on March 12, 2013

It's actually relevant here. For people less familiar with some aspects of Unicode, this is a neat example, along the same lines as the OP. Just don't try to select it!

psionski · on March 12, 2013

When you say "don't try", some people take it as a personal challenge :) Why does this happen? How can it mess up my selection like this?

bzbarsky · on March 12, 2013

Mess up in what sense? Selection seems to work reasonably on that text for me in Firefox....

psionski · on March 12, 2013

It's displayed with boxes here, not the actual zalgo text (although I can paste it somewhere and looks correctly there), although my encoding is set to UTF and I can't select single boxes. The whole page kept blinking, but this was another issue (the layout of the comments, when I went over some borders it kept selecting and deselecting the whole page which caused it to blink).

bzbarsky · on March 13, 2013

Huh. Is that in Firefox? On what OS?

psionski · on March 13, 2013

Chrome on Windows 7 with U.S. English regional settings... As boring a setup as it gets!

bzbarsky · on March 14, 2013

Sounds like a possible WebKit bug, then, to be honest...

shrikant · on March 12, 2013

Same here. Displays and able to select it quite alright. (Ubuntu 12.10, Firefox 21)

Although searching for the same text results in Google telling me this:

  414. That's an error.

  The requested URL /... is too large to process. That’s all we know.

arethuza · on March 13, 2013

"Just don't try to select it!"

Why - seems to select perfectly well when I try it using Firefox on Windows 7.

tambourine_man · on March 12, 2013

On the iPhone

http://imgur.com/dXwEHeN.jpg

juan_juarez · on March 12, 2013

Thank you for ruining my page rendering.

dice · on March 12, 2013

What does it do? Chrome on Linux just has it a bit garbage-y, nothing major: http://i.imgur.com/BopYxYM.png

juan_juarez · on March 12, 2013

Chrome Windows 7 ends up rendering them all as boxes. The lack of whitespace prevents line breaking. The problem is that it forces the container out to 1400px wide, rather than responding to browser width.

I really don't like trying to read 1400px long lines of text.

UberMouse · on March 12, 2013

Chrome on Windows 7 renders all the text fine for me

http://i.imgur.com/h7S9SU3.png

rorrr · on March 13, 2013

Chrome on Windows 7 here, renders most of them as boxes.

darkstalker · on March 12, 2013

I see most of the Unicode characters on Firefox/Gentoo Linux with the following packages installed:

    media-fonts/arphicfonts
    media-fonts/baekmuk-fonts
    media-fonts/cardo
    media-fonts/corefonts
    media-fonts/dejavu
    media-fonts/droid
    media-fonts/font-bh-lucidatypewriter-100dpi
    media-fonts/font-bh-lucidatypewriter-75dpi
    media-fonts/font-bh-ttf
    media-fonts/font-bh-type1
    media-fonts/freefont
    media-fonts/freefonts
    media-fonts/inconsolata
    media-fonts/intlfonts
    media-fonts/kochi-substitute
    media-fonts/symbola
    media-fonts/terminus-font
    media-fonts/ttf-bitstream-vera

buster · on March 12, 2013

Interestingly in my Chrome (27 dev) it looks like this in Linux: http://i.imagebanana.com/img/voz9pem7/Auswahl_048.png

logn · on March 12, 2013

I prefer firefox (mac): http://i.imgur.com/Wcjz21T.png

hakaaaaak · on March 13, 2013

Yep, FF on mac here and it looks great. Why do people keep using Chrome, IE, Safari, and Opera?

one-man-bucket · on March 12, 2013

I don't get what's supposed to happen. To me it looks like this: http://i.imgur.com/CMwdLNg.png

stjarnljuset · on March 12, 2013

This is what it looks like for me: http://i.imgur.com/ppMYz6c.png

Tekker · on March 12, 2013

Same for me

mxfh · on March 12, 2013

+1 same for me,

(to me it looks like some kind of particle accelerator experiment happening in your browser, where ก็็็็็ emits some kind of unknown radiation. After closer examination [zoom to +300%]: maybe it just shows the escaping life spirits of the toppled latin small letter «u» after being shot in right side.)

drzaiusapelord · on March 12, 2013

Chrome on Windows 7 64-bit:

http://imgur.com/3u1M2IG

C1D · on March 12, 2013

It seems to render right on your browser. This is what I see: http://i.imgur.com/gJ1xBWn.png

ajross · on March 12, 2013

It's actually more correct on your browser. It's a stack of diacritics. The rasterizer above was printing them on top of each other, which is (well, seems likely to be) typographically incorrect. The problem of course is that the implemented rules should disallow unused-in-the-real-world combinations of diacritics but don't.

But it's not a "rendering" bug.

cataflam · on March 12, 2013

About the same for me : http://i.imgur.com/fOTopJU.png

Both Opera and Chrome on Windows XP. On windows 7 I get the interestingly looking results with both Opera and Chrome.

Gmo · on March 12, 2013

I have tested with Firefox (latest version) on Windows XP and 7, and I think we can conclude the problem comes from Windows XP, as it shows the same thing as you, whereas it gives a whole big stack on Windows 7.

sp332 · on March 12, 2013

What browser and platform are you using?

MBCook · on March 12, 2013

Works fine for me. Safari 6, OS X 10.8.

ante_annum · on March 12, 2013

Works in Chrome 25, 10.6.8

gilgoomesh · on March 13, 2013

I'm pretty sure it's a Windows GDI font rendering problem. Windows DirectWrite, Mac's CoreText and other platforms seem to do okay.

frozenport · on March 12, 2013

It seems to work on Firefox x64 19.0 from Gentoo eBuild.

iy56 · on March 12, 2013

With Opera 12.14, I saw the same thing. I had to open the page in Chrome to see the weirdness.

one-man-bucket · on March 12, 2013

chrome 25 and xubuntu 12.10

C1D · on March 12, 2013

It seems to work on most linux distro's. I'm using Chrome 25 and Windows 7 and It's buggy.

ambiate · on March 12, 2013

http://i.imgur.com/5ov1qdf.png

This is an interesting result. Even more interesting is the extra 2000 results that Google throws in my direction.

simcop2387 · on March 12, 2013

I actually get a different rendering for some reason. I get all the extra diatricts to the right above empty circles. That certainly explains why I didn't quite understand the problem.

raydev · on March 12, 2013

Looks like it's only a problem on Windows (and maybe Linux?). OS X is fine, Chrome and Safari.

morphics · on March 12, 2013

Same here, looks fine for me in Chrome and Firefox on Ubuntu

sp332 · on March 12, 2013

That's not really a Thai character, right? It's way too many bytes! It must be an intentional repetition of stacking diacritics. Some of the ones in that Google result page are 21 bytes.

trotsky · on March 12, 2013

   $ echo ก็็็็็็็็็็็็็็็็็็็ | hexdump
   0000000 e0 b8 81 e0 b9 87 e0 b9 87 e0 b9 87 e0 b9 87 e0
   0000010 b9 87 e0 b9 87 e0 b9 87 e0 b9 87 e0 b9 87 e0 b9
   0000020 87 e0 b9 87 e0 b9 87 e0 b9 87 e0 b9 87 e0 b9 87
   0000030 e0 b9 87 e0 b9 87 e0 b9 87 e0 b9 87 e0 b9 87 0a
   0000040
   $ echo ก็็็็็็็็็็็็็็็็็็็ | wc
       1       1      64

Leynos · on March 12, 2013

Something I wondered once: If one were to have a go at sanitizing Unicode input (e.g., for a forum), what would be a sensible limit on the number of diacritics to allow, without interfering with languages that need them?

mark-r · on March 12, 2013

I love StackOverflow: http://stackoverflow.com/questions/11978912/how-to-protect-a...

Leynos · on March 12, 2013

Thanks. I'd asked on SO about Unicode sanitization before and got a very "brush off" answer. Seems I was asking the wrong question.

feee · on March 12, 2013

Yep. With the diacritics spaced out:

     ก ็ ็ ็ ็ ็

sprogcoder · on March 12, 2013

looks like this is the unicode in html

ก ็ ็ ็ ็ ็ ็ ็ ็ ็ ็ ็ ็ ็ ็ ็ ็ ็ ็ ็ ็

kijeda · on March 12, 2013

Yes, it is "THAI CHARACTER KO KAI" (U+0E01), plus 20 occurrences of the non-spacing mark "THAI CHARACTER THAI CHARACTER MAITAIKHU" (U+0E47).

sp332 · on March 12, 2013

It's odd, but I think the HN title has only 5 of those marks. Does HN code limit the stacking?

masklinn · on March 12, 2013

Indeed, pasting from the google input and investigating it's a single U+0E01 ก (THAI CHARACTER KO KAI) with 20 U+0E47 ็ (THAI CHARACTER MAITAIKHU), a diacritic.

tvirot · on March 12, 2013

It's definitely an intentional repetition of stacking diacritics. ก็ has a meaning in Thai, but not ก ็ ็ ็

hackerboos · on March 12, 2013

It's 2 characters.

pawelwentpawel · on March 12, 2013

This character used to be (or maybe still is) a very popular way of trolling people on facebook. Flooding chat window with those funny letters seemed to crash the browser after a while.

clone1018 · on March 12, 2013

T̳͉̱ͯͩ͌͐ͮ͜ͅh̆͏̫̫̫̜̫a͇͕̮̘͉ͣͫ̑̀ͭtͩ̀ͪ̇̈ ͕̩͓̺͔ͤ͠w̼̘͒́̓͗o͏͕̱͉̠ủ̥̠̯̫͙͙͖ͧ̿l̮͓̣̣̥͂ͬ͟d̪̦̏ͩ̐͝ ̬̮̳̦̠ͫ̇͠b̴̄́e̮ͯ̇̂͂̚͠ ̱̬̄́̃̏͋̅z̷̰̞̙̼͓ͤ̏̐̈ȁ͍̫̽ͫ̌͐͌̆l̦͔̐̇ͧ̐̎͝ǧ̢̜̱ͯ͌ö̳̐ͤ͗̍̇ͅ

That would be zalgo: http://eeemo.net/

mnsc · on March 12, 2013

That doesn't render 100% for me, lots of square boxes.

http://knowyourmeme.com/memes/zalgo

fwr · on March 12, 2013

You are missing out. http://i.imgur.com/8pPNeKh.png

RunningDroid · on March 12, 2013

What are you using? On Firefox on Linux I get this: http://i.imgur.com/Xspttp3.png

sukuriant · on March 12, 2013

That's really fascinating that it results in different output depending on your browser.

fwr · on March 12, 2013

Strange, mine was from Safari on OS X.

sageikosa · on March 12, 2013

Switch the page encoding to Thai.

henrikschroder · on March 12, 2013

HE COMES, HE COMES!

patmcguire · on March 12, 2013

Once had this dude sign up for our page management... we had a lot of assumptions about plain or at least sane text that had to get updated. https://www.facebook.com/glitchr

josscrowcroft · on March 12, 2013

That is hilarious. Can't scroll down without it crashing the tab.

jre · on March 12, 2013

Works fine on Chrome 25.0.1364.97 on Ubuntu 12.10. Is this a browser issue or something with the underlying platform font rendering libs ?

j-g-faustus · on March 12, 2013

The problem seems to be restricted to Windows - apparently it looks normal on Ubuntu and Mac. (Here: Mac 10.6, works fine on Chrome, Firefox and Safari.) So I guess it's a platform issue.

culturestate · on March 13, 2013

Crashes both Safari 6.0.1 and Chrome 25.0.1364.155 on OSX 10.8.2 over here.

ericabiz · on March 12, 2013

Chrome on Win7 64-bit here; it does not crash as I scroll down.

Xcelerate · on March 12, 2013

Same here. Crashes in Chrome. I wonder if that could be exploited...

bazzargh · on March 12, 2013

There's a twitter account, glitchr which posts lots of these things: https://twitter.com/glitchr_

glitchr's tweets may cause other twitter clients to crash too, eg this one (you have been warned!) https://twitter.com/joshlogan42/status/303975029698342912

micampe · on March 12, 2013

Posted a million times already but timeless: http://stackoverflow.com/a/1732454

praptak · on March 12, 2013

In cases like this it is useful to post a short description of the link. Like this: http://stackoverflow.com/a/1732454 (the famous "don't parse HTML with regex" cry for sanity, semi-relevant because of zalgo text.) FTFY

lhnz · on March 12, 2013

What on earth is going on here? 🔴҈҈҈҈̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚

Groxx · on March 12, 2013

Would you mind indenting that 2 spaces on its own line so it doesn't stretch the page out? (I see all square boxes, sadly, and it's pretty wide)

masklinn · on March 12, 2013

That's a pretty fun one. It's a sequence of the following codepoints: {LARGE RED CIRCLE} {COMBINING CYRILLIC HUNDRED THOUSANDS SIGN} {COMBINING CYRILLIC HUNDRED THOUSANDS SIGN} {COMBINING CYRILLIC HUNDRED THOUSANDS SIGN} {COMBINING CYRILLIC HUNDRED THOUSANDS SIGN} and then 66 repetitions of {COMBINING LEFT ANGLE ABOVE} {COMBINING CYRILLIC MILLIONS SIGN}

DanBC · on March 12, 2013

Please add some spaces in the middle of that long line.

HN is now not wrapping long lines, because your unbroken long line has widened the margins.

ars · on March 12, 2013

It looks like this to me: http://i.imgur.com/qjKCL62.png

kalleboo · on March 13, 2013

I don't think it's supposed to be a long line. Here is how it shows here http://imgb.mp/i3b.jpg

pmelendez · on March 12, 2013

It's interesting compare this across browsers http://imgur.com/a/0jxjj

claudius · on March 12, 2013

I guess Opera is just really, really boring [0,1].

[0] http://imgur.com/joMTxLm

[1] http://imgur.com/TEmOTPo

rplacd · on March 12, 2013

Huh - my Chrome renders it like a photon out of a Feynman diagram.

Socketubs · on March 12, 2013

How this character is rendered in Simcity ?

3dptz · on March 12, 2013

It's not

TwoBit · on March 13, 2013

Funny you mention this. We are working on support for this right now (today). EAWebKit, used by SimCity, didn't originally have Thai support, and it's being implemented now. I can tell you what it will look like though, as I just took a screenshot: http://i.imgur.com/ZFtGP87.png. It's conventional that repeated Thai decorators stick with the base glyph, though that character is invalid Thai. We might fix it nevertheless.

fractalsea · on March 13, 2013

Hmm, that's interesting. At work I looked at this on my Ubuntu machine with the Chromium browser. It didn't look particularly special because all the diacritic marks were drawn on top of each other in the same location above the letter.

I come home and look at it on my Windows machine with Chrome, and now I see the big stack of diacritic marks that I assume everyone's making a fuss about. I assume it's something to do with the way that the system's installed font lays out the marks in question.

afiler · on March 12, 2013

Since this page is already full of weird unicode stuff, I may as well show off this fun tool that I wrote.

𝑀𝑎𝑛𝑦 𝑝𝑒𝑜𝑝𝑙𝑒 𝑤𝑜𝑛'𝑡 𝑏𝑒 𝑎𝑏𝑙𝑒 𝑡𝑜 𝑠𝑒𝑒 𝑡ℎ𝑖𝑠, 𝑜𝑟 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑎𝑙𝑙 𝑜𝑓 𝑡ℎ𝑒 𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑠, 𝑠𝑖𝑛𝑐𝑒 𝐼 𝑡ℎ𝑖𝑛𝑘 𝑎 𝑈𝑛𝑖𝑐𝑜𝑑𝑒 6.0 𝑓𝑜𝑛𝑡 𝑖𝑠 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑.

𝔼𝕧𝕖𝕟 𝕗𝕖𝕨𝕖𝕣 𝕗𝕠𝕟𝕥𝕤 𝕙𝕒𝕧𝕖 𝕥𝕙𝕖 𝕗𝕦𝕝𝕝 𝕕𝕠𝕦𝕓𝕝𝕖-𝕤𝕥𝕣𝕦𝕔𝕜 𝕒𝕝𝕡𝕙𝕒𝕓𝕖𝕥, 𝕥𝕙𝕠𝕦𝕘𝕙 𝕚𝕥 𝕨𝕠𝕣𝕜 𝕗𝕚𝕟𝕖 𝕗𝕠𝕣 𝕞𝕖 𝕠𝕟 𝕆𝕊 𝕏.

http://mar.cx/unicate/

darkstalker · on March 12, 2013

there are similar tools like http://www.panix.com/~eli/unicode/convert.cgi?text=example+t...

eksith · on March 12, 2013

I just clicked on the image results and I'm... confused :/

Also, it seems to not work on all browsers and even then, FF and IE do slightly different things : http://i.imgur.com/hfWu5Bs.png

I'm on Win7.

Edit: I just noticed, on FF, the character spills out of the tab preview text and onto the chrome background as well.

josephjrobison · on March 12, 2013

ก็็็็็็็็็็็็็็็็็็็็

DanBC · on March 12, 2013

I kind of wish the poster had included a screen shot of what happens; what should happen; and what definitely shouldn't happen.

I (OS X; crome) get little blobs over the n. That's wrong? But doesn't break the page?

mech4bg · on March 12, 2013

Yeah I was completely baffled by this until I looked in Google Image Search. I ran this on my Windows box and sure enough it looks crazy with little 'springs' going everywhere. Under Chrome, Firefox and Safari on OS X it looks normal though, just foreign text.

darkhorn · on March 12, 2013

ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้

warfangle · on March 13, 2013

Your comment forces HN into horizontal scrolling mode. Care to add some spaces in there to not fuck everyone over?

deadfall · on March 12, 2013

I was going to ask about this a week ago. An "Anonymous" twitter account posted it last week and the letters overlaid 3 or 4 tweets above it. Had no idea what it was.

mkhalil · on March 13, 2013

ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิ

aidos · on March 12, 2013

So that's what's going on! I get this in Skype on some Thai account names. Been wondering about this for ages (over a year)

nwh · on March 12, 2013

You might be interested in: http://twitter.com/crashtxt

AndyKelley · on March 13, 2013

Can we just mark this language as deprecated? It's too complicated to render!

lnanek2 · on March 12, 2013

Chrome on OS X doesn't have a huge stack of them coming off the letter...

DocG · on March 12, 2013

Oh gawd, there are dirt all over my screen now..

giulianob · on March 12, 2013

If anyone can figure out a way to strip out those nasty cases while still preserving valid accents in most common languages then please let me know.

nghi199 · on March 12, 2013

Mobile Safari doesn't seem to have this issue.

trumbitta2 · on March 12, 2013

Nice.

Is it a "Like a Boss" character?

http://www.leer-leren.com/wp-content/uploads/2012/07/Like-a-...

trumbitta2 · on March 13, 2013

I get jokes are not ever admitted in this club?

HoochTHX · on March 12, 2013

Seems all the major sites have this flaw.

Fundlab · on March 13, 2013

ก็็็็็็็็็็็็็็็็็็็็

zobzu · on March 12, 2013

works fine on my linux, not on my windows ;)

mydpy · on March 12, 2013

This is awesome.

priyaranjan · on March 13, 2013

Opens eyes to unicode for me!

estitesc · on March 12, 2013

Missingno

jQueryIsAwesome · on March 12, 2013

Google, you are a company that makes billions of dolars and I'm nobody to tell you what to do buy maybe you should add this to your CSS:

    .st, a {
        display: inline-block;
        overflow: hidden;
    }

Hacker news should do the same thing but with .title, .comment and .comhead

ubershmekel · on March 12, 2013

Surprisingly it's actually the correct way to render that unicode. It's just a stack of upper diactrics http://jsbin.com/erajer/7/?%E0%B8%81%E0%B9%89%E0%B9%89%E0%B9... as explained in this stackoverflow answer http://stackoverflow.com/questions/10414864/whats-up-with-th...

jQueryIsAwesome · on March 12, 2013

Technically "correct" you mean; if there was an unicode character that filled all the screen with the color black it would not matter if it were technically "correct", usability correct is more important.

jlarocco · on March 12, 2013

I disagree.

If stacking diacritics are a legitimate part of a language and they change the meaning of words or characters, then it's more important that the content be correct than the "usability."

In fact, if a person can't read it or reads it incorrectly because parts of characters are hidden, then it's not very usable.

Hypothesizing about a Unicode character that fills the screen with black is a nonsensical straw man, because it makes no sense in "real world" written languages, so there would never be a Unicode character for it.

jQueryIsAwesome · on March 12, 2013

False; diacritics commonly used in the real-world such as "´¨`" fit inside the same space as the characters, this successive chain of diacritics is never used in real-world texts except for very few obscure cases. Plus the implementation of UTF8 should include the line-height required for the correct displaying of the character if they really believe the displaying of obscure characters is more important than usability.

ryanpetrich · on March 12, 2013

UTF-8 is an encoding and has no say on how a given character should be rendered.

aviraldg · on March 12, 2013

What is obscure for you is not obscure for someone else.

TazeTSchnitzel · on March 13, 2013

Are you implying Thai diacritics are not "commonly used in the real world"?

mostlystatic · on March 12, 2013

What would that do?

kbrackbill · on March 12, 2013

A bunch of stuff, but the important thing here is that it clips things above and below each line so things like this can't overlap other lines

lorettahe · on March 12, 2013

It doesn't actually cap the line. It caps the block. Sometimes the block can extend to multiple lines.

http://i.imgur.com/698dzMo.png

After trying with the css suggested here on Google inside Chrome, the problametic characters don't have the top going across too high like originally, but it gets capped on the top of the first line of the paragraph, instead of the line the character is actually at.

That said, I'm not sure how to solve this though.

kbrackbill · on March 12, 2013

Doh, you're right. For some reason I had it in my head that the block itself wrapped onto multiple lines but I must have been thinking of something else.

It seems like there isn't really any CSS only solution for this without wrapping every character in its own element like jQueryIsAwesome suggested.

jQueryIsAwesome · on March 12, 2013

The thing is to avoid overlapping with other elements, if they want to ruin their own description/comment so be it.

But just for the technical challenge you could do something like this in Javascript to chop every individual special character:

    string.split("").map(function(a){ 
        return /[a-z0-9\s]/i.test(a) ? a : '<span class="s_char">' + a + '</span>'; 
    }).join("");

And apply the mentioned CSS to the "s_char" class.

windsurfer · on March 12, 2013

Oh god please no.

jQueryIsAwesome · on March 12, 2013

explain.

sukuriant · on March 12, 2013

It's a huge hammer and a considerably larger webpage footprint just so that a type of character isn't allowed to run amok, something that's not going to happen anyway in the vast majority of cases.

jQueryIsAwesome · on March 12, 2013

We have very different definitions of "huge hammer", in any decent modern browser (Chrome, Firefox, IE9) it takes less than 10ms to apply the mentioned code in 20 texts (using the linked 10 Google results as an example).

sukuriant · on March 12, 2013

It also puts every single special character into its own span. That may negatively affect programs that trust HTML for copy-paste, for example Open Office, Microsoft Word (I think?), some IM clients, some text editors, etc. Doesn't seem worth it in the general case.

That said, if those sorts of things bother an individual, they could run that on the page themselves, I suppose, so it's good for that :)

jQueryIsAwesome · on March 12, 2013

I don't know OpenOffice but In most programs putting spans around a letter or word does not affect pasting.

jamesaguilar · on March 13, 2013

10ms is a huge amount to spend for such an obscure fix. How many obscure corner cases do you think Google webpages have to account for? A lot more than 100, I would bet. So rough methods like these would be horrible for performance.

jQueryIsAwesome · on March 13, 2013

I invite you to look at the ridiculous amounts of JS lines twitter uses in its interface.

jamesaguilar · on March 14, 2013

You're the one who made the 10ms claim, not me. If indeed it is faster than you claimed, then maybe it would be ok. Still probably not worth engineering time for a problem of this magnitude.

sukuriant · on March 13, 2013

And Twitter is horridly slow on its website. That's what we're trying to avoid

windsurfer · on March 12, 2013

It is a huge amount of CPU and memory and wasted resources to correct a problem sometimes and only for some edge cases.

nickpatch · on March 13, 2013

It corrupts Unicode data because it splits on code points instead of grapheme clusters. When performed on 'Spin̈al Tap', it splits the base character U+006E (LATIN SMALL LETTER N) from the combining character U+0308 (COMBINING DIAERESIS) and results in the string 'Spin<span class="s_char">̈</span>al Tap', which contains the valid Unicode grapheme cluster '>̈'! If you were to split on grapheme clusters instead, the result would be 'Spi<span class="s_char">n̈</span>al Tap'. However, I still wouldn't support that solution because it could negatively affect text segmentation used by search engine indexing and natural language processing tools.

martinced · on March 13, 2013

See that: yet another issue with characters in filenames/directory names with aren't printable ASCII characters.

This kind of stuff is precisely the reason why we make sure that every filename we create is only using a subset of ASCII (and no space of course). In our source code, in our builds, in the desktop app we're serving, etc.

Unicode characters entered by users should go in one place: the DB.

I smiled the other day when I read about the build script for Chromium: it clearly specificied that the source directory must not contain any space in its name.

Of course it shouldn't. That's experience.