Hacker News new | past | comments | ask | show | jobs | submit login
Thai character ก็็็็็ (ก) gets rendered in a strange way (google.co.uk)
228 points by sprogcoder on March 12, 2013 | hide | past | favorite | 147 comments



Here's a basic explanation of the diacritic notion as it applies to Asian scripts.

Thai belongs to a family of scripts known as abugidas. Abugidas include pretty much all South Asian and many Southeast Asian scripts, for example Burmese, Cambodian, Dai, Lao, Thai, etc. They all pretty much derive from Brahmi, which was the proto Indian script. You can see an example of Brahmi over here: http://en.wikipedia.org/wiki/Brahmi

Abugidas are based upon combining multiple glyphs in to syllables, often allowing glyphs above, below, to the left and to the right of the initial consonant, and often including a closing consonant. Most glyphs tend to be consonants, though some are vowels, and others can be special marks for indicating tone or other notions. Often shorter vowels are excluded (as in Modern Standard Arabic).

In old times, such scripts were handled with wacky font-hacks. However, with Unicode, there are some super complex algorithms that make glyphs combine both visually (when typesetting) and logically (when saving/searching/etc). You can actually type a character and a diacritic and it can sometimes automatically combine to form a single character, if such a beast exists, not just visually but when saving to disk.

What makes it even more confusing is that South Asian scripts in particular have mega-combo characters, where whole chunks of glyphs sort of fold in to flowing short-hand symbols. In the case of Sanskrit, I believe loads of these were used in history but few are used these days.

I think that's a fair pontification - corrections welcome!


I wouldn't call them "mega-combo characters" ;-) Typically, these are at most two or three letters written together, and they are very much used today in Sanskrit, Hindi and other regional variants such as Bengali etc.

Of course you also have multiple words that have been combined into one large compound word, by way of appropriate linguistic rules of combining sounds. This is similar to long compound words in German.


Apparently sometimes called conjuncts. http://pratyeka.org/sanskrit/conjuncts.html


This is the term I'm familiar with. They are similar to ligatures which exist in the Italic scripts e.g & (e+t), œ (o+e), German ß (ſ(long s)+s) etc. But conjuncts in Devanagari are much more numerous and widely used, probably because the individual consonant forms are fairly complex.


i have learned the thai alphabet and that "mega-combo character" comment just made my day. Reading thai pretty much feels like reading regexps. Anditdoesntmakeitanyeasierthattheydontusespaces


Actually Thai only has roughly twice the number of characters that we have in Roman scripts - excepting tones (a real pain) it is possible to learn pretty quickly. Lao by contrast has less, but has a rather tricky plethora of vowel combinations for a myriad of hard to distinguish eww, ieww, ooh, iuooh type sounds. :) Cambodian has no tones and is my pick for the one to go for if you are keen on an easy starter.

Tangential tidbit: I sent a copy of The Cambodian System of Writing (http://pratyeka.org/csw/) to TPB's anakata while he was solitary confinement to help him stave off boredom. No idea if he ever read it, though his mother assures me it arrived.


Besides ก (0xe01), the rest of the Thai characters (in 0xe02-0xe59) are letters ขฃคฅฆงจฉชซฌญฎฏฐฑฒณดตถทธนบปผฝพฟภมยรฤลฦวศษสหฬอฮฯะาๅ, alphabetic non-spacing marks ัิีึืฺุูํ which can follow the preceding characters, and render as stacked (as shown by the OP), five "marks" เแโใไ (with unicode's Logical Order Exception) which can precede the following character, diacritic non-spacing marks ็่้๊๋์๎ which also render as stacked, modifier letter ๆ ,digits 0-9 ๐๑๒๓๔๕๖๗๘๙ ,currency symbol Baht ฿, and punctuation symbols ๏๚๛

Thai looks to be a pretty gr๏๏vy language, like many other natural languages - perhaps some programming languages will catch up in their enhanced use of lexical tokens one day, instead of just relying on grammar, long English names packed into name hierarchies, and multi-ASCII symbols.


I think I would steer new learners of SE Asian scripts away from Cambodian as a first language to learn. Cambodian, while it is a beautiful script and has more-or-less regular pronunciation, also has a few odd exceptions, and complex vowel pronunciation rules. The consonants are divided into two groups, and many of the vowels are pronounced differently in the first group than in the second. That said, after learning the Thai script, Lao and Cambodian were not too hard. Either way, they are all great languages to learn.


Actually I am almost certain Thai and Lao have those consonant divisions as well, in fact I believe 3 or 4. If I am not mistaken they are still taught and are part of the tone system and/or can affect unwritten vowel selection. More certainly, the consonant classes somehow stem from the need to preserve pronunciation of Pali, a middle-Indian prakrit language (with features not present in these SEA countries' modern languages) that is used as the littoral language of Theravadin ("older school") Buddhism. See http://pali.pratyeka.org/ for more info on that.


(Native Thai speaking.)

The worst thing about "not using space" when it comes to computer is that it's nearly impossible to do word-breaking/line-breaking without relying on dictionary[1] which is very hard to convince software developers not using system text engine to add support for line breaking[2].

The suffer still continues to this date, Android, for instance, still lack a proper support (break by character instead of word) and few apps lack of Thai word-breaking support at all (Twitter, whose only do word break by space).

[1]: http://linux.thai.net/svn/software/libthai/trunk/data/ [2]: https://bugzilla.mozilla.org/show_bug.cgi?id=7969 it took Mozilla 9 years. Opera, on the other hand, still lacks proper Thai support.


Stuff like that is what was used to implement zalgo text

Ę̮̱͔͓ͯ͗ͫ̌̏ͫ͌́x̘̤͚̰̫̫̗̤̱̒̓ͨͯ͑̓ͥͫ̕å̰͚̓͒ͫm̛̤͕̫̳̺̩̄̓ͨͥ͜ͅp̰͉͗ͤl̵̖̗̫͍͓͋̍̐͌̐̒e̡̧͔̮̿͒͋̈́͡ ̸͉͔͗͐̍ͩͫ̀ͭz̨͎̱̟̘̓ä́͊̉̾͜͏̺̲̘l̛̥͇͖̹̻̜̈̀̀g̴̗̻͚͙̭͍̩̔̉̆ͦ͌͘oͬ̾͑̉̋҉̢͙̹̹̺̺ ̷̢͖̲͇̺̪̹̙̺̘͐̄ͬ̍͆t̶͔̣̜̟͌̀ͪ̅ͧ̒̒ͫ̚ȅ̠̪̻̄ͫ̋͝xͭ͆͝͏̮͔̜t̟̬̦̣̟͉͈̞̝ͣͫ͞,̡̼̭̘̙̜ͧ̆̀̔ͮ́ͯͯ ̢̮͎̦͙͇ͪͪ̈͌ͬ̄̓̐͞ḷ̹̺̙̜̇̉́͡o̢̻̪̠̬̍͐̉ͮͥ̑͊ͪt̢̘̬͓͕̬́ͪ̽́s̢̜̠̬̘͖̠͕ͫ͗̾͋͒̃͛̚͞ͅ ̝̣̥̳͇͎̭̾̔̀̀̔̽̕o͇ͮ̋̅͋͆̈́̔͗͟f̙̙͕̮̈ͪͯ̿̈͠ ̯͎̺͎̺̃̀͟͟d͍͍̺͂̂i̪̩̙̭̝͖ͥ͂̂̈̒̎r̥̜̃̏̃͋̓ͥ̃̉̄͘͢t̳̦̬͆͂ͬͧ̏ͬ̓y̵̮̗̟ͩ̃̾͐́ͩ ̣͍̘͈̫͓̊ͤ̚͡͝cͥͭ͐̎͆͘̕҉̫̞h̴̢̫̘͉̖ͪͩ̓ͪͯ̑͑̓̎͝a̧̢̖͔̗̬̘̯̟ͪ̐͌̍͂̊r̷̝͓̬͆̄̽̓̋ͬ̈̔͝͠ā̗͑ͬ̀c͒̎͌̔͛͘҉̘͖͖̖̯̖͖͙ṱ̶͇͚͎ͯ͋͢͝eͦ̽͆͏̟̭̠r̙̖͙̳̾ͯ̈̕ṣ͙̈͆̔͗̉ͥ̋̔̕


I've been told off before for pasting that on HN before. I'd suggest you edit your comment.


It's actually relevant here. For people less familiar with some aspects of Unicode, this is a neat example, along the same lines as the OP. Just don't try to select it!


When you say "don't try", some people take it as a personal challenge :) Why does this happen? How can it mess up my selection like this?


Mess up in what sense? Selection seems to work reasonably on that text for me in Firefox....


It's displayed with boxes here, not the actual zalgo text (although I can paste it somewhere and looks correctly there), although my encoding is set to UTF and I can't select single boxes. The whole page kept blinking, but this was another issue (the layout of the comments, when I went over some borders it kept selecting and deselecting the whole page which caused it to blink).


Huh. Is that in Firefox? On what OS?


Chrome on Windows 7 with U.S. English regional settings... As boring a setup as it gets!


Sounds like a possible WebKit bug, then, to be honest...


Same here. Displays and able to select it quite alright. (Ubuntu 12.10, Firefox 21)

Although searching for the same text results in Google telling me this:

  414. That's an error.

  The requested URL /... is too large to process. That’s all we know.


"Just don't try to select it!"

Why - seems to select perfectly well when I try it using Firefox on Windows 7.



Thank you for ruining my page rendering.


What does it do? Chrome on Linux just has it a bit garbage-y, nothing major: http://i.imgur.com/BopYxYM.png


Chrome Windows 7 ends up rendering them all as boxes. The lack of whitespace prevents line breaking. The problem is that it forces the container out to 1400px wide, rather than responding to browser width.

I really don't like trying to read 1400px long lines of text.


Chrome on Windows 7 renders all the text fine for me

http://i.imgur.com/h7S9SU3.png


Chrome on Windows 7 here, renders most of them as boxes.


I see most of the Unicode characters on Firefox/Gentoo Linux with the following packages installed:

    media-fonts/arphicfonts
    media-fonts/baekmuk-fonts
    media-fonts/cardo
    media-fonts/corefonts
    media-fonts/dejavu
    media-fonts/droid
    media-fonts/font-bh-lucidatypewriter-100dpi
    media-fonts/font-bh-lucidatypewriter-75dpi
    media-fonts/font-bh-ttf
    media-fonts/font-bh-type1
    media-fonts/freefont
    media-fonts/freefonts
    media-fonts/inconsolata
    media-fonts/intlfonts
    media-fonts/kochi-substitute
    media-fonts/symbola
    media-fonts/terminus-font
    media-fonts/ttf-bitstream-vera


Interestingly in my Chrome (27 dev) it looks like this in Linux: http://i.imagebanana.com/img/voz9pem7/Auswahl_048.png


I prefer firefox (mac): http://i.imgur.com/Wcjz21T.png


Yep, FF on mac here and it looks great. Why do people keep using Chrome, IE, Safari, and Opera?


I don't get what's supposed to happen. To me it looks like this: http://i.imgur.com/CMwdLNg.png


This is what it looks like for me: http://i.imgur.com/ppMYz6c.png


Same for me


+1 same for me,

(to me it looks like some kind of particle accelerator experiment happening in your browser, where ก็็็็็ emits some kind of unknown radiation. After closer examination [zoom to +300%]: maybe it just shows the escaping life spirits of the toppled latin small letter «u» after being shot in right side.)


Chrome on Windows 7 64-bit:

http://imgur.com/3u1M2IG


It seems to render right on your browser. This is what I see: http://i.imgur.com/gJ1xBWn.png


It's actually more correct on your browser. It's a stack of diacritics. The rasterizer above was printing them on top of each other, which is (well, seems likely to be) typographically incorrect. The problem of course is that the implemented rules should disallow unused-in-the-real-world combinations of diacritics but don't.

But it's not a "rendering" bug.


About the same for me : http://i.imgur.com/fOTopJU.png

Both Opera and Chrome on Windows XP. On windows 7 I get the interestingly looking results with both Opera and Chrome.


I have tested with Firefox (latest version) on Windows XP and 7, and I think we can conclude the problem comes from Windows XP, as it shows the same thing as you, whereas it gives a whole big stack on Windows 7.


What browser and platform are you using?


Works fine for me. Safari 6, OS X 10.8.


Works in Chrome 25, 10.6.8


I'm pretty sure it's a Windows GDI font rendering problem. Windows DirectWrite, Mac's CoreText and other platforms seem to do okay.


It seems to work on Firefox x64 19.0 from Gentoo eBuild.


With Opera 12.14, I saw the same thing. I had to open the page in Chrome to see the weirdness.


chrome 25 and xubuntu 12.10


It seems to work on most linux distro's. I'm using Chrome 25 and Windows 7 and It's buggy.


http://i.imgur.com/5ov1qdf.png

This is an interesting result. Even more interesting is the extra 2000 results that Google throws in my direction.


I actually get a different rendering for some reason. I get all the extra diatricts to the right above empty circles. That certainly explains why I didn't quite understand the problem.


Looks like it's only a problem on Windows (and maybe Linux?). OS X is fine, Chrome and Safari.


Same here, looks fine for me in Chrome and Firefox on Ubuntu


That's not really a Thai character, right? It's way too many bytes! It must be an intentional repetition of stacking diacritics. Some of the ones in that Google result page are 21 bytes.


   $ echo ก็็็็็็็็็็็็็็็็็็็ | hexdump
   0000000 e0 b8 81 e0 b9 87 e0 b9 87 e0 b9 87 e0 b9 87 e0
   0000010 b9 87 e0 b9 87 e0 b9 87 e0 b9 87 e0 b9 87 e0 b9
   0000020 87 e0 b9 87 e0 b9 87 e0 b9 87 e0 b9 87 e0 b9 87
   0000030 e0 b9 87 e0 b9 87 e0 b9 87 e0 b9 87 e0 b9 87 0a
   0000040
   $ echo ก็็็็็็็็็็็็็็็็็็็ | wc
       1       1      64


Something I wondered once: If one were to have a go at sanitizing Unicode input (e.g., for a forum), what would be a sensible limit on the number of diacritics to allow, without interfering with languages that need them?



Thanks. I'd asked on SO about Unicode sanitization before and got a very "brush off" answer. Seems I was asking the wrong question.


Yep. With the diacritics spaced out:

     ก ็ ็ ็ ็ ็


looks like this is the unicode in html

ก ็ ็ ็ ็ ็ ็ ็ ็ ็ ็ ็ ็ ็ ็ ็ ็ ็ ็ ็ ็


Yes, it is "THAI CHARACTER KO KAI" (U+0E01), plus 20 occurrences of the non-spacing mark "THAI CHARACTER THAI CHARACTER MAITAIKHU" (U+0E47).


It's odd, but I think the HN title has only 5 of those marks. Does HN code limit the stacking?


Indeed, pasting from the google input and investigating it's a single U+0E01 ก (THAI CHARACTER KO KAI) with 20 U+0E47 ็ (THAI CHARACTER MAITAIKHU), a diacritic.


It's definitely an intentional repetition of stacking diacritics. ก็ has a meaning in Thai, but not ก ็ ็ ็


It's 2 characters.


This character used to be (or maybe still is) a very popular way of trolling people on facebook. Flooding chat window with those funny letters seemed to crash the browser after a while.


T̳͉̱ͯͩ͌͐ͮ͜ͅh̆͏̫̫̫̜̫a͇͕̮̘͉ͣͫ̑̀ͭtͩ̀ͪ̇̈ ͕̩͓̺͔ͤ͠w̼̘͒́̓͗o͏͕̱͉̠ủ̥̠̯̫͙͙͖ͧ̿l̮͓̣̣̥͂ͬ͟d̪̦̏ͩ̐͝ ̬̮̳̦̠ͫ̇͠b̴̄́e̮ͯ̇̂͂̚͠ ̱̬̄́̃̏͋̅z̷̰̞̙̼͓ͤ̏̐̈ȁ͍̫̽ͫ̌͐͌̆l̦͔̐̇ͧ̐̎͝ǧ̢̜̱ͯ͌ö̳̐ͤ͗̍̇ͅ

That would be zalgo: http://eeemo.net/


That doesn't render 100% for me, lots of square boxes.

http://knowyourmeme.com/memes/zalgo


You are missing out. http://i.imgur.com/8pPNeKh.png


What are you using? On Firefox on Linux I get this: http://i.imgur.com/Xspttp3.png


That's really fascinating that it results in different output depending on your browser.


Strange, mine was from Safari on OS X.


Switch the page encoding to Thai.


HE COMES, HE COMES!


Once had this dude sign up for our page management... we had a lot of assumptions about plain or at least sane text that had to get updated. https://www.facebook.com/glitchr


That is hilarious. Can't scroll down without it crashing the tab.


Works fine on Chrome 25.0.1364.97 on Ubuntu 12.10. Is this a browser issue or something with the underlying platform font rendering libs ?


The problem seems to be restricted to Windows - apparently it looks normal on Ubuntu and Mac. (Here: Mac 10.6, works fine on Chrome, Firefox and Safari.) So I guess it's a platform issue.


Crashes both Safari 6.0.1 and Chrome 25.0.1364.155 on OSX 10.8.2 over here.


Chrome on Win7 64-bit here; it does not crash as I scroll down.


Same here. Crashes in Chrome. I wonder if that could be exploited...


There's a twitter account, glitchr which posts lots of these things: https://twitter.com/glitchr_

glitchr's tweets may cause other twitter clients to crash too, eg this one (you have been warned!) https://twitter.com/joshlogan42/status/303975029698342912


Posted a million times already but timeless: http://stackoverflow.com/a/1732454


In cases like this it is useful to post a short description of the link. Like this: http://stackoverflow.com/a/1732454 (the famous "don't parse HTML with regex" cry for sanity, semi-relevant because of zalgo text.) FTFY


What on earth is going on here? 🔴҈҈҈҈̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚҉̚


Would you mind indenting that 2 spaces on its own line so it doesn't stretch the page out? (I see all square boxes, sadly, and it's pretty wide)


That's a pretty fun one. It's a sequence of the following codepoints: {LARGE RED CIRCLE} {COMBINING CYRILLIC HUNDRED THOUSANDS SIGN} {COMBINING CYRILLIC HUNDRED THOUSANDS SIGN} {COMBINING CYRILLIC HUNDRED THOUSANDS SIGN} {COMBINING CYRILLIC HUNDRED THOUSANDS SIGN} and then 66 repetitions of {COMBINING LEFT ANGLE ABOVE} {COMBINING CYRILLIC MILLIONS SIGN}


Please add some spaces in the middle of that long line.

HN is now not wrapping long lines, because your unbroken long line has widened the margins.


It looks like this to me: http://i.imgur.com/qjKCL62.png


I don't think it's supposed to be a long line. Here is how it shows here http://imgb.mp/i3b.jpg


It's interesting compare this across browsers http://imgur.com/a/0jxjj


I guess Opera is just really, really boring [0,1].

[0] http://imgur.com/joMTxLm

[1] http://imgur.com/TEmOTPo


Huh - my Chrome renders it like a photon out of a Feynman diagram.


How this character is rendered in Simcity ?


It's not


Funny you mention this. We are working on support for this right now (today). EAWebKit, used by SimCity, didn't originally have Thai support, and it's being implemented now. I can tell you what it will look like though, as I just took a screenshot: http://i.imgur.com/ZFtGP87.png. It's conventional that repeated Thai decorators stick with the base glyph, though that character is invalid Thai. We might fix it nevertheless.


Hmm, that's interesting. At work I looked at this on my Ubuntu machine with the Chromium browser. It didn't look particularly special because all the diacritic marks were drawn on top of each other in the same location above the letter.

I come home and look at it on my Windows machine with Chrome, and now I see the big stack of diacritic marks that I assume everyone's making a fuss about. I assume it's something to do with the way that the system's installed font lays out the marks in question.


Since this page is already full of weird unicode stuff, I may as well show off this fun tool that I wrote.

𝑀𝑎𝑛𝑦 𝑝𝑒𝑜𝑝𝑙𝑒 𝑤𝑜𝑛'𝑡 𝑏𝑒 𝑎𝑏𝑙𝑒 𝑡𝑜 𝑠𝑒𝑒 𝑡ℎ𝑖𝑠, 𝑜𝑟 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑎𝑙𝑙 𝑜𝑓 𝑡ℎ𝑒 𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑠, 𝑠𝑖𝑛𝑐𝑒 𝐼 𝑡ℎ𝑖𝑛𝑘 𝑎 𝑈𝑛𝑖𝑐𝑜𝑑𝑒 6.0 𝑓𝑜𝑛𝑡 𝑖𝑠 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑.

𝔼𝕧𝕖𝕟 𝕗𝕖𝕨𝕖𝕣 𝕗𝕠𝕟𝕥𝕤 𝕙𝕒𝕧𝕖 𝕥𝕙𝕖 𝕗𝕦𝕝𝕝 𝕕𝕠𝕦𝕓𝕝𝕖-𝕤𝕥𝕣𝕦𝕔𝕜 𝕒𝕝𝕡𝕙𝕒𝕓𝕖𝕥, 𝕥𝕙𝕠𝕦𝕘𝕙 𝕚𝕥 𝕨𝕠𝕣𝕜 𝕗𝕚𝕟𝕖 𝕗𝕠𝕣 𝕞𝕖 𝕠𝕟 𝕆𝕊 𝕏.

http://mar.cx/unicate/



I just clicked on the image results and I'm... confused :/

Also, it seems to not work on all browsers and even then, FF and IE do slightly different things : http://i.imgur.com/hfWu5Bs.png

I'm on Win7.

Edit: I just noticed, on FF, the character spills out of the tab preview text and onto the chrome background as well.


ก็็็็็็็็็็็็็็็็็็็็


I kind of wish the poster had included a screen shot of what happens; what should happen; and what definitely shouldn't happen.

I (OS X; crome) get little blobs over the n. That's wrong? But doesn't break the page?


Yeah I was completely baffled by this until I looked in Google Image Search. I ran this on my Windows box and sure enough it looks crazy with little 'springs' going everywhere. Under Chrome, Firefox and Safari on OS X it looks normal though, just foreign text.


ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้


Your comment forces HN into horizontal scrolling mode. Care to add some spaces in there to not fuck everyone over?


I was going to ask about this a week ago. An "Anonymous" twitter account posted it last week and the letters overlaid 3 or 4 tweets above it. Had no idea what it was.


ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิ


So that's what's going on! I get this in Skype on some Thai account names. Been wondering about this for ages (over a year)


You might be interested in: http://twitter.com/crashtxt


Can we just mark this language as deprecated? It's too complicated to render!


Chrome on OS X doesn't have a huge stack of them coming off the letter...


Oh gawd, there are dirt all over my screen now..


If anyone can figure out a way to strip out those nasty cases while still preserving valid accents in most common languages then please let me know.


Mobile Safari doesn't seem to have this issue.



I get jokes are not ever admitted in this club?


Seems all the major sites have this flaw.


ก็็็็็็็็็็็็็็็็็็็็


works fine on my linux, not on my windows ;)


This is awesome.


Opens eyes to unicode for me!


Missingno


Google, you are a company that makes billions of dolars and I'm nobody to tell you what to do buy maybe you should add this to your CSS:

    .st, a {
        display: inline-block;
        overflow: hidden;
    }
Hacker news should do the same thing but with .title, .comment and .comhead


Surprisingly it's actually the correct way to render that unicode. It's just a stack of upper diactrics http://jsbin.com/erajer/7/?%E0%B8%81%E0%B9%89%E0%B9%89%E0%B9... as explained in this stackoverflow answer http://stackoverflow.com/questions/10414864/whats-up-with-th...


Technically "correct" you mean; if there was an unicode character that filled all the screen with the color black it would not matter if it were technically "correct", usability correct is more important.


I disagree.

If stacking diacritics are a legitimate part of a language and they change the meaning of words or characters, then it's more important that the content be correct than the "usability."

In fact, if a person can't read it or reads it incorrectly because parts of characters are hidden, then it's not very usable.

Hypothesizing about a Unicode character that fills the screen with black is a nonsensical straw man, because it makes no sense in "real world" written languages, so there would never be a Unicode character for it.


False; diacritics commonly used in the real-world such as "´¨`" fit inside the same space as the characters, this successive chain of diacritics is never used in real-world texts except for very few obscure cases. Plus the implementation of UTF8 should include the line-height required for the correct displaying of the character if they really believe the displaying of obscure characters is more important than usability.


UTF-8 is an encoding and has no say on how a given character should be rendered.


What is obscure for you is not obscure for someone else.


Are you implying Thai diacritics are not "commonly used in the real world"?


What would that do?


A bunch of stuff, but the important thing here is that it clips things above and below each line so things like this can't overlap other lines


It doesn't actually cap the line. It caps the block. Sometimes the block can extend to multiple lines.

http://i.imgur.com/698dzMo.png

After trying with the css suggested here on Google inside Chrome, the problametic characters don't have the top going across too high like originally, but it gets capped on the top of the first line of the paragraph, instead of the line the character is actually at.

That said, I'm not sure how to solve this though.


Doh, you're right. For some reason I had it in my head that the block itself wrapped onto multiple lines but I must have been thinking of something else.

It seems like there isn't really any CSS only solution for this without wrapping every character in its own element like jQueryIsAwesome suggested.


The thing is to avoid overlapping with other elements, if they want to ruin their own description/comment so be it.

But just for the technical challenge you could do something like this in Javascript to chop every individual special character:

    string.split("").map(function(a){ 
        return /[a-z0-9\s]/i.test(a) ? a : '<span class="s_char">' + a + '</span>'; 
    }).join("");
And apply the mentioned CSS to the "s_char" class.


Oh god please no.


explain.


It's a huge hammer and a considerably larger webpage footprint just so that a type of character isn't allowed to run amok, something that's not going to happen anyway in the vast majority of cases.


We have very different definitions of "huge hammer", in any decent modern browser (Chrome, Firefox, IE9) it takes less than 10ms to apply the mentioned code in 20 texts (using the linked 10 Google results as an example).


It also puts every single special character into its own span. That may negatively affect programs that trust HTML for copy-paste, for example Open Office, Microsoft Word (I think?), some IM clients, some text editors, etc. Doesn't seem worth it in the general case.

That said, if those sorts of things bother an individual, they could run that on the page themselves, I suppose, so it's good for that :)


I don't know OpenOffice but In most programs putting spans around a letter or word does not affect pasting.


10ms is a huge amount to spend for such an obscure fix. How many obscure corner cases do you think Google webpages have to account for? A lot more than 100, I would bet. So rough methods like these would be horrible for performance.


I invite you to look at the ridiculous amounts of JS lines twitter uses in its interface.


You're the one who made the 10ms claim, not me. If indeed it is faster than you claimed, then maybe it would be ok. Still probably not worth engineering time for a problem of this magnitude.


And Twitter is horridly slow on its website. That's what we're trying to avoid


It is a huge amount of CPU and memory and wasted resources to correct a problem sometimes and only for some edge cases.


It corrupts Unicode data because it splits on code points instead of grapheme clusters. When performed on 'Spin̈al Tap', it splits the base character U+006E (LATIN SMALL LETTER N) from the combining character U+0308 (COMBINING DIAERESIS) and results in the string 'Spin<span class="s_char">̈</span>al Tap', which contains the valid Unicode grapheme cluster '>̈'! If you were to split on grapheme clusters instead, the result would be 'Spi<span class="s_char">n̈</span>al Tap'. However, I still wouldn't support that solution because it could negatively affect text segmentation used by search engine indexing and natural language processing tools.


See that: yet another issue with characters in filenames/directory names with aren't printable ASCII characters.

This kind of stuff is precisely the reason why we make sure that every filename we create is only using a subset of ASCII (and no space of course). In our source code, in our builds, in the desktop app we're serving, etc.

Unicode characters entered by users should go in one place: the DB.

I smiled the other day when I read about the build script for Chromium: it clearly specificied that the source directory must not contain any space in its name.

Of course it shouldn't. That's experience.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: